Experiments - Towards Universal 3D Lifting

We tested out a few rigid and non-rigid categories in our end-to-end image/video to 3D pose pipeline. Shown below are some of the results from our experiments.

Image-to-2D

Head and leg keypoint of the chimp being tracked across frames in a video

Failure cases

Non-object centric keypoints (keypoint 9) being detected.

Keypoints on the right and left leg of the chimpanzee being swapped between the frames

2D keypoints are being extracted by the method described but as shown above, there are a few problems. We add first-frame supervision and temporal consistency into the model to eliminate these. We also make 3D-LFM robust against these kinds of noise as explained below.

Robust 2D-to-3D

Since the 2D-to-3D model takes in 2D keypoints from a detector model and not ground truth, the model has to be robust against noise and occlusions. As shown above, the keypoints can be noisy with a gaussian sigma of upto 30 in the pixel space. Hence, we re-train the model to make it more robust.

2D-to-3D lifting on keypoints with sigma=10 gaussian noise

2D-to-3D lifting on keypoints with 10% of keypoints masked

Shown above is the behaviour of 3D-LFM in terms of MPJPE (Mean Per Joint Position Error) against input noise and occlusions. The performance degrades with increasing sigma of gaussian noise and number of keypoints occluded whereas what we desire is for the performance to not degrade that much. Hence, we train a robust 3D-LFM and we call it as 3D-LFM++.

The first table displays the performance on Human3.6 validation set with and without noise. We train one version of 3D-LFM (model 2) from scratch only on noisy data and another version (model 3) with just finetuning a pretrained model on the noisy data. The way noise is introduced into the data is inspired from motionBERT as explained previously. We obtain the best trade-off between performance on noisy and noiseless data using model 3.

The second table shows that we achieve performance close to SOTA (motionBERT) with our 3D-LFM++ even without using temporal context!

Due to this robust training, few challenges such as depth ambitguity are solved as shown above.

Mesh reconstructions

Shown below are some of the outputs from our image-to-mesh model created using tokenHMR decoder.