SVIRO was created to investigate and benchmark machine learning approaches for application in the passenger compartment regarding common challenges of realistic engineering applications. In particular, SVIRO can be used to evaluate the generalization and robustness of machine learning models when trained on a limited number of variations.

The sceneries in the different vehicle interiors were generated randomly. We partitioned the available human models, child seats and backgrounds such that one part is only used for the training images (for all the vehicles) and the other part is used for the test images. Consequently, the dataset has an intrinsic dominant background, object and texture bias: all of the images are taken in a few passenger compartments, but generalization to new, unseen, passenger compartments and child seats should be achieved.

The dataset consists of 10 different vehicle interiors and 25.000 sceneries in total.

Ground Truth

A detailed description of the ground truth data is given on this page. For each scenery, we randomly selected what kind of object is placed at each seat position. We used the following different categories (images are examples for the different categories available):

Infant seat

Child seat


Everyday object

The child and infant seats can either be empty, or occupied by a baby or child respectively.

The labeling of the objects for the different tasks varies slightly, because we wanted to treat the infant/child and the infant/child seat as two different instances for segmentation and object detection. In the table below, you find the ground truth labels associated with the different objects for the different tasks.

 ClassificationSegmentation / Object detectionKeypoints
Infant in infant seat1-1
Child in child seat2-1
Everyday object440
Empty infant seat510
Empty child seat620


At the moment, our dataset consists of ten different car models. The number of windows varies, which causes different lightning conditions, and some cars have only two rear seats instead of three. Further, the camera position and orientation varies, which results in different perspectives.

Hyundai – Tucson

BMW – X5

Renault – Zoe

Lexus – GS F

Toyota – Hilux

Tesla – Model 3

VW – Tiguan

BMW – i3

Mercedes – A Class

Ford – Escape


We used the same people and child seats for the training set of each vehicle and the remaining ones for the test sets. This results in two child seats and one infant seat per data split. We did the same for the background: five were selected for the training and five different ones for the test set. For the everyday objects, we used two bags, a card- box and a cup for the training dataset and a different bag, a paper-bag, pillows and a box of bottles for the test set. The number of people and the distribution of the gender, age and ethnicity for the training and test set can be found in the following table:

  Train Test

The number of images generated for each vehicle and each training and test set are identical. In total, this results in 20000 training and 5000 test sceneries. The number and constellation of appearances varies between the different vehicles, because all the sceneries were generated randomly. The distribution of the different classes along the different vehicles and data splits is summarized in the following table. For each cell, the left number is for the training split and the right one for the test. IS stands for infant seat and CS for child seat. We mark by (R) a randomized dataset (we randomly selected the environments and textures from a large pool of available assets and changed the colors randomly). Empty seats are dominant, which causes an imbalanced distribution along the different classes.

 EmptyISCSAdultObjectEmpty ISEmpty CS
A Class2134 / 614457 / 126611 / 121884 / 191755 / 179486 / 124673 / 145
Escape2079 / 569489 / 133581 / 143940 / 215742 / 187443 / 108726 / 145
GS F2127 / 565465 / 121579 / 140907 / 219791 / 195468 / 113663 / 147
Hilux2218 / 553457 / 116560 / 130847 / 232769 / 194510 / 125639 / 150
i3884 / 180372 / 117496 / 98919 / 223442 / 129363 / 113524 / 140
Model 32507 / 613449 / 121537 / 107909 / 224565 / 196439 / 105594 / 134
Tiguan2196 / 592458 / 112645 / 128944 / 227650 / 180461 / 112646 / 149
Tucson2202 / 565458 / 103608 / 139900 / 231658 / 204481 / 119693 / 139
X52400 / 610371 / 109569 / 100892 / 234767 / 195418 / 124583 / 128
X5 (R)2392 / -397 / -525 / -896 / -754 /-429 / -607 / -
Zoe909 / 195380 / 125518 / 115816 / 189438 / 131392 / 119547 / 126


Infrared imitation

Many applications in the passenger compartment require an active infrared camera system to work in the dark. We decided to imitate such a system by means of a simple approach: We placed an active red lamp (R=100%, G=0%, B=0%) next to the camera inside of the car illuminating the rear seat, but overlapping with the illumination from the HDR background image. We then took the red channel only from the resulting RGB image. We refer to these images as grayscale images. This is, however, not a physically accurate simulation of a real active infrared camera system. Nevertheless, we become less dependent on the environmental lightning and we can facilitate the tasks. See the figure below for a comparison between a standard RGB image and our grayscale image for a dark scenery, where a lot of information would be lost.

Validation on real infrared images

We tested the transferability of a model trained on SVIRO to real infrared images for instance segmentation. We fine-tuned all layers of a pre-trained Mask R-CNN model with a ResNet-50 backbone. The synthetic images were blurred to be closer to real infrared images. We combined the training images of the i3, Tucson and Model 3 and compare results on synthetic and real images in the X5. Only bounding boxes and masks with a confidence of at least 0.5 are plotted. The model performs similarly across real (bottom row) and synthetic (top row) images and sometimes fails to detect objects. This is expected as the model has only seen a limited amount of variation. However, the similar child seat is detected in the real images, but not in the synthetic ones. We believe that investigations on SVIRO are transferable to real applications as the resulting model behaves similarly on real and synthetic images.

Design Choices

During the data generation process we tried to simulate the conditions of a realistic application. We decided to partition the available human models, child seats and backgrounds such that one part is only used for the training images (for all the vehicles) and the other part is used for the test images. For each of the ten different vehicle passenger compartments and available child seats, we fixed the texture as if real images had been taken. Consequently, the machine learning models need to generalize to previously unknown variations of humans, child seats and environments. The facial expression for all human models is identical and neutral and the seat belts were not attached.

We can create images under defined conditions (e.g. same scenery, but under different lightning conditions) so that additional investigations can be performed in future works. Since our goal was to provide a versatile dataset, the latter can be used to test additional challenges. For example, one can train models only on infant seats with the handle down and test it on seats with the handle up.

We also generated a train dataset with randomly selected textures and backgrounds from a large pool of available images in order to test the influence of the texture on the different tasks.


We used the free and open source 3D computer graphics software Blender 2.79 and its Python API to construct and render the synthetic 3D sceneries. For our dataset, we selected a subset of available seats on the market, from which we then created a 3D model so that it could be used in our simulation. The 3D models were generated using depth cameras (Kinect v1) and precise structured light scanners (Artec Eva). We used textures (Albedo, Normal and Roughness images) from (with permission) for all the objects in the scene. The environmental background and lightning were created by means of High Dynamic Range Images (HDRI) from HDRI Haven. The human models (adults, children and babies) and their clothing (additional clothes were downloaded from the community assets), were randomly generated by using the open source 3D graphic software MakeHuman 1.2.0. The 3D models of the cars were purchased from Hum3D and everyday objects (e.g. backpacks, boxes, pillows) were downloaded from Sketchfab.