Results

Qualitative Analysis

The image below shows the input images that were unseen during training and rendered 3D models from the poses predicted by the trained network. The alignment between the input images and rendered 3D models shows that the network is learning to predict poses that are close to the actual poses from which the images have been captured.

Figure: The top row shows the input images for the pose prediction network and the bottom row shows the rendered RGB images of the 3D model from predicted poses

Quantitative Analysis

We quantify our model by measuring errors in the translational and rotation components of the pose. We quantify the error in translational error using RMSE and rotational error using cosine similarity between ground truth and predicted quaternions. The metrics are defined as follows.

Figure: Metrics used to quantify our model
MetricValue
Position Error1.44 meters
Angular Error5.38 degrees
Table: Metrics on the unseen validation dataset

The table above shows the metrics computed on the unseen validation dataset. The position and angular errors are a fraction of the size of the C17 aircraft which is more than 50m long.

Inference

Since the model is a prototype of a model that will be used in commercial applications, we measured metrics relevant to the deployment of the model. The table below shows the metrics.

MetricValue
No. of parameters21.3
Inference time2.7 ms 
(at full precision on RTX 3090Ti)
FLOPS38.4 billion
Table: Metrics relevant for model inference