3D-LFM: Lifting Foundation Model

- Utilizes Graph Transformer with Procrustean alignment to learn non-rigid deformations
- Robust against order and number of input keypoints – displays permutation equivariance
- Shows OOD generalization on unseen categories
Unsupervised Keypoints from Pretrained Diffusion Models

- Detects semantically meaningful 2D key points in an unsupervised way
- Uses emergent knowledge within pretrained Stable Diffusion model
- A randomized text embedding is optimized with the diffusion model to learn to attend to the relevant keypoints in the image
MotionBERT: A Unified Perspective on Learning Human Motion Representations

- Motion encoder learns human motion patterns using pretraining on noisy, occluded inputs
- Uses dual-stage spatio temporal attention blocks
- Finetuned MLP for downstream tasks – pose estimation, mesh reconstruction, activity recognition
TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

- A feature extractor + vision transformer backbone with seperate MLP heads to predict camera parameters, pose and shape from an input image
- Instead of regressing SMPL pose, it predicts a pose token class which is then reconstructed into a continuous pose using the pretrained codebook
- A VQ-VAE is pretrained to act as a tokenizer codebook where SMPL poses are encoded into discrete tokens and reconstructed by the decoder