Overview - Towards Universal 3D Lifting

Motivation

Infering 3D pose and 3D structures of non-rigid deformable objects is of utmost importance in computer vision research. Extracting 3D structure from 2D images or videos is a hard and ambiguous problem, especially for deformable objects like animals and humans. Traditionally 3D pose is obtained using multi-camera rigs or domes which has the ability to capture multi-view images of the objects inside it and infer the 3D structure from it. However, this setup requires accurate camera calibration, is expensive and is not feasible for categories other than humans. With the availabilty of data and deep learning techniques, we intend to build an end-to-end pipeline to obtain the 3D structures with just image or video given as input across multiple categories of rigid and non-rigid objects.

Applications

AR/VR
Animal/Human Behaviour Modelling
3D Activity Recognition
Dense 3D Object Reconstruction
Dynamic Deformable Object Splatting

Towards a Universal 3D Lifting Pipeline

The goal is to build a pipeline which can take image or a video of most rigid (boat, airplane, bottle, vase etc.) and non-rigid objects (humans, animals, face, hand etc.) and extract the 3D structure from it. It should have 3 key properties.

E2E nature – minimal human supervision
OOD generalization – foundational model
Temporal consistency – preserve semantics across frames

Additionally, can we also reconstruct a mesh?

As shown above, the universal image-to-mesh foundational pipeline that we intend to build has three stages. Firstly, 2D keypoints would be extracted from the image input, then the 2D keypoints would be lifted into 3D and lastly a dense 3D mesh would be reconstructed using the 3D keypoints.