This page provides an overview about how the different ground truth data for the different tasks are represented when you download the dataset. If something is not clear, please feel free to contact us, so that we can extend this page with additional information.

Images can be downloaded as RGB, grayscale (simulated active infrared camera system) or depth maps. However, sceneries can appear quite dark when RGB images are used. The ground truth data is identical for all three types of images.

The depth images are provided as uint16 .png files with the values being in millimeters.

File names

The file names of all the different ground truth data and images are constructed in the following way, such that most of the important information is already included in the naming:


As the rectangular images contain only information about a single seat, the naming is slightly different. However, the imageID still refers to the whole image and we additionally specify which seat is shown (from 0 to 2 representing left to rigth) :



There are seven different classes for classification, as introduced in the overview.

Classification is performed on each individual seat. Consequently, the images need to be split into three rectangles such that each seat can be classified individually. For cars with only two seats, the middle one is not used. The rectangles slightly overlap with the neighboring ones, because objects are not limited to their seat position.

From each whole image, the rectangular ones are cropped according to the following start positions, each one having a width of 250 pixels and a height of 550 pixels:

X_START = [130, 364, 582]
Y_START = [70, 70, 70]

When downloading the dataset for classification, you will find that the images are organized according to the following directory tree:

├──────────── 0
├──────────── 1
├──────────── 2
├──────────── 3
├──────────── 4
├──────────── 5
└──────────── 6

Each subfolder contains only images for a specific class. The above folder structure can then immediately be used in combination with the torchvision ImageFolder class to start your training.

Semantic and instance segmentation

We provide a position and class based instance segmentation mask, which can be used as ground truth data for semantic and instance segmentation. For both tasks, we have five different classes and we want to separate infants/children from their seats and classify them as persons.

Each pixel is colored according to which object it belongs to, but also according to which seat (left, middle or right) the object is placed on.

The masks need to be converted to integer values, because most cost functions and deep learning frameworks ask for an integer value as ground truth label. A simple approach to get a label between 0 and 4 is to transform the mask to grayscale. For example, one can use PIL and the following function:

from PIL import Image'L')

After that, each grayscale value can be converted to an integer (and position) using the following function

def get_class_by_gray_pixel_value(gray_pixel_value):
    Returns for a grayscale pixel value the corresponding ground truth label as an integer (0 to 4). 
    Further, the function also outputs on which seat the object is placed (left, middle or right). 

    The relationship between gray_pixel_value and the class integer depends on the grayscale transformation function used. 
    This functions should work fine with PIL and opencv.

    0 = background
    1 = infant seat
    2 = child seat
    3 = person
    4 = everyday object

    Keyword arguments:
    gray_pixel_value -- grayscale pixel value between 0 and 225

    class_label, position

    # background
    if gray_pixel_value == 226 or gray_pixel_value == 225:
        return 0, None

    # infant seat
    if gray_pixel_value == 76:
        return 1, "left"

    if gray_pixel_value == 179 or gray_pixel_value == 178:
        return 1, "middle"

    if gray_pixel_value == 167 or gray_pixel_value == 166:
        return 1, "right"

    # child seat
    if gray_pixel_value == 150 or gray_pixel_value == 149:
        return 2, "left"

    if gray_pixel_value == 196 or gray_pixel_value == 195:
        return 2, "middle"

    if gray_pixel_value == 78 or gray_pixel_value == 77:
        return 2, "right"

    # person
    if gray_pixel_value == 29:
        return 3, "left"

    if gray_pixel_value == 227:
        return 3, "middle"

    if gray_pixel_value == 132 or gray_pixel_value == 131:
        return 3, "right"

    # everyday objects
    if gray_pixel_value == 105:
        return 4, "left"

    if gray_pixel_value == 208:
        return 4, "middle"

    if gray_pixel_value == 0:
        return 4, "right"

    return None, None

Notice: depending on which function and library you use to convert the mask to grayscale, different values need to be used in the above function, because different libraries might use different grayscale transformations. The above function should work with PIL and opencv.

Bounding boxes

For each scenery, we provide a single text file which contains the bounding boxes of all objects in the scene. We assume that the origin of the images is at the upper left corner. Each line in the text file corresponds to a different object and each line contains the following information:

[class_label], [x_upper_left_corner], [y_upper_left_corner], [x_lower_right_corner], [y_lower_right_corner]

As an example, such a file could look as follows:


In this scenery, we would have an empty infant seat (label 1) at the left seat and a child (label 3) in a child seat (label 2) at the right seat.


For each scenery, we provide a single .json file which includes the poses of all people (babies included) in the scene. We save the human poses by using keypoints, as used by the COCO dataset, but our skeleton is defined using partially different joints. The .json file contains the 2D pixel coordinates of the keypoints of all people together with the visibility flag, the bone names and their seat position. It is constructed as follows:

    [name_of_person_1_in_the_scene]: {
        "bones": {
            [name_of_bone_1]: [
            [name_of_bone_1]: [
        "position": [seat_position_of_person_1]
    [name_of_person_2_in_the_scene]: {
        "bones": {
        "position": [seat_position_of_person_2]

Here is a list of all available bone names:


For our benchmark, we only use 17 of the provided bones, namely the ones named:

["head", "clavicle_r", "clavicle_l", "upperarm_r", "upperarm_l",  "lowerarm_r", "lowerarm_l", "hand_r", "hand_l", "thigh_r", "thigh_l", "calf_r", "calf_l", "pelvis", "neck_01", "spine_02", "spine_03"]

The visibility of the keypoints are set to

  • 0, if the keypoint is outside the image,
  • 1, if it is occluded by an object or neighboring human,
  • 2, if it is visible or occluded by the person itself.

For the benchmark evaluation, only the bones with a visibility of 2 are considered. Usually, this means that bones with a visibility of 0 and 1 are set to a visibility of 0 and the bones with a visibility of 2 are set to a visibility of 1.