Fall ’23 progress

Visual odometry

We remodel our visual odometry module such that it can be used in unseen environments as well. To this end, we take 2 RGB frames from different timesteps and find the relative pose of the glasses at the 2 timesteps.

We utilize the general Structure from Motion pipeline for this purpose starting with finding robust matched correspondences. We use Superpoint (and SuperGlue) for finding reliable correspondences between the 2 image frames. Using these matched correspondences, we find the Essential matrix. The E matrix is now decomposed to find the relative rotation and translation between the 2 frames.

We observe that the error in rotation and translation increases with increasing the timestep gap between 2 frames.

Inertial odometry

We use the IDOL model with extension to 6-D pose instead of 5-D pose to be used with the visual odometry predictions.


We aim to use the visual odometry module only when there is a major drift in the inertial odometry module. We use a classifier for this purpose. The classifier is trained on ground truth poses and the IDOL’s pose predictions using a hyperparameter threshold. The model architecture is a simple 3-layered MLP.

Visual-Inertial odometry

When we combine the VO and IO predictions in a weighted fashion and turn on the camera based on the classifier’s predictions, we are able to get accurate pose predictions with very low power consumption.

Spring ’23 results

Visual Odometry

The inceptionNet-based model produces similar results to the ResNet-based model with nearly half the number of FLOPs. Thus, we went ahead with InceptionNet architecture itself to train on the Smith Hall dataset collected using Aria glasses.

The metrics used here are RMSE (Root Mean Square Error) over position and orientation error in terms of the difference in degrees between predicted and ground truth. As we can see, the position and orientation error using Visual Odometry is pretty low.

The model has been tested on 2 completely different datasets. The King’s College dataset is an outdoor dataset with a variety of difficult scenarios like motion blur, a clutter of pedestrians, and varying lighting conditions. On the other hand, the Smith Hall dataset is an indoor dataset that has very few keypoint features in certain areas and dim lighting as well. Thus, the Visual only odometry module performs robustly.

Inertial Odometry

These are the results from our Inertial-only module which has been tested on 3 datasets to showcase robustness.

The metric used for measuring orientation error is the RMSE between the predicted quaternions and the ground truth quaternions.

For position error, we use Absolute Trajectory Error (ATE) which is the RMSE between corresponding points in the estimated and ground truth trajectories. This is a measure of global consistency and usually increases with trajectory length.

Combined Low-Power Visual-Inertial Odometry

Plot showing an increase of error and decrease of FLOPS by reducing the frequency of VO module utilization

In order to capitalize on the pros of each component, we aim to utilize the accurate Visual Odometry (VO) module at fixed intervals of time in order to reset the low-power Inertial Odometry (IO) module.

We are gradually reducing the frequency of using VO for pose prediction. At every ith second, we are using the VO prediction instead of the IO prediction and resetting the IO module using this prediction. Thus, we see that as we remove the dependency on VO, the position and orientation error keeps increasing as expected. These experiments have been performed on the Smith Hall dataset.

The plot also notes the Floating Point Operations Per Second (FLOPS) of our system. FLOPS is a reliable way of reporting how power-hungry and latency-causing a module is. Thus, we note that as we keep reducing the usage frequency of the VO module, the FLOPS also reduces. This is because the FLOPS for the VO module is 2 G and for the IO is 471 K. The VO module has higher latency than the IO module.

Thus, we conclude that using VO intermittently enables improved accuracy of the IO method.



Our task is to develop a low-power state estimation algorithm for Aria glasses using a history of Inertial Measurement Unit (IMU) measurements combined with a sparse number of RGB images as input.


The IMU is a combination of the accelerometer, gyroscope, and magnetometer sensors. The accelerometer sensor measures the linear acceleration while the gyroscope measures the angular velocity. The magnetometer measures the magnetic flux density, which is just a combination of magnetic field strength and direction.

Each of the 3 sensors takes measurements in the 3 axes relative to the device.


Fusing the camera’s relative pose with the inertial sensor measurements can increase the accuracy and robustness of the state estimates when compared with inertial-only methods such as IDOL with just a small increase in power consumption.