Synthetic Data Collection
It is essential that we have sufficient data to train our models such that they can generalize to different kinds of aircraft and to different hangar environments.
To acquire this data, we exploited ShapeNet’s rich repository of aircraft models. We first manually annotate S models with 9 to 11 3D keypoints in the world coordinate system. Then, we render each model at N different views. For each view, we compare the depth of each 3D keypoint in the camera coordinate system with the depth value of its 2D projection on the corresponding depth map. The outcome of this comparison indicates whether the keypoint is visible in that view.
The images rendered do not contain a background. Therefore, we blend each of these renderings with M random aircraft hangar background images downloaded from the internet. This way, we can acquire a dataset of size S x N x M.
Model Training
Overview
We train a model to predict keypoints given an image containing an aircraft.
Dataset
For initial training, we use images from the aeroplane class in PASCAL3D+ dataset. There are 1906 images in the training set and 477 images in the validation set. We augment the images at the time of training to artificially enlarge the size of the dataset. The figure below shows some example images from the dataset.

Fig : Sample images from the dataset
Eight keypoints are annotated for each image at the locations shown in the figure below.

Fig : Locations of keypoints
Training Configuration
We train a network with a simple encoder-decoder architecture to predict the heatmaps corresponding to keypoints. The configuration of training is described in table below:
Encoder | ResNet-34 |
Decoder | UNet like decoder with 8 output channels |
Loss Function | L2 loss |
Learning Rate | 1e-3 with step decay after each epoch |
Batch Size | 16 |
Epochs | 100 |