Synthetic Data Collection

It is essential that we have sufficient data to train our models such that they can generalize to different kinds of aircraft and to different hangar environments.

To acquire this data, we exploited ShapeNet’s rich repository of aircraft models. We first manually annotate S models with 9 to 11 3D keypoints in the world coordinate system. Then, we render each model at N different views. For each view, we compare the depth of each 3D keypoint in the camera coordinate system with the depth value of its 2D projection on the corresponding depth map. The outcome of this comparison indicates whether the keypoint is visible in that view.

The images rendered do not contain a background. Therefore, we blend each of these renderings with M random aircraft hangar background images downloaded from the internet. This way, we can acquire a dataset of size S x N x M.

Model Training


We train a model to predict keypoints given an image containing an aircraft.


For initial training, we use images from the aeroplane class in PASCAL3D+ dataset. There are 1906 images in the training set and 477 images in the validation set. We augment the images at the time of training to artificially enlarge the size of the dataset. The figure below shows some example images from the dataset.

Fig : Sample images from the dataset

Eight keypoints are annotated for each image at the locations shown in the figure below.

Fig : Locations of keypoints

Training Configuration

We train a network with a simple encoder-decoder architecture to predict the heatmaps corresponding to keypoints. The configuration of training is described in table below:

DecoderUNet like decoder with 8 output channels
Loss FunctionL2 loss
Learning Rate1e-3 with step decay after each epoch
Batch Size16
Training Configuration