Related Work

3D-LFM: Lifting Foundation Model

  • Utilizes Graph Transformer with Procrustean alignment to learn non-rigid deformations
  • Robust against order and number of input keypoints – displays permutation equivariance
  • Shows OOD generalization on unseen categories


Unsupervised Keypoints from Pretrained Diffusion Models

  • Detects semantically meaningful 2D key points in an unsupervised way
  • Uses emergent knowledge within pretrained Stable Diffusion model
  • A randomized text embedding is optimized with the diffusion model to learn to attend to the relevant keypoints in the image

MotionBERT: A Unified Perspective on Learning Human Motion Representations

  • Motion encoder learns human motion patterns using pretraining on noisy, occluded inputs
  • Uses dual-stage spatio temporal attention blocks
  • Finetuned MLP for downstream tasks – pose estimation, mesh reconstruction, activity recognition

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

  • A feature extractor + vision transformer backbone with seperate MLP heads to predict camera parameters, pose and shape from an input image
  • Instead of regressing SMPL pose, it predicts a pose token class which is then reconstructed into a continuous pose using the pretrained codebook
  • A VQ-VAE is pretrained to act as a tokenizer codebook where SMPL poses are encoded into discrete tokens and reconstructed by the decoder