Video-Based Online Human Mesh Recovery

Group Members: Yiwen Zhao, Liting Wen, Aniket Agarwal, Ce Zheng, Laszlo Jeni

Introduction

Task:

estimate the parametric human mesh from the input video.

Our focus:

  • Integrate spatial-temporal information from video
    • Minimize the scale variation.
    • Boost motion smoothness and temporal consistency.
  • Online inference
    • Minimize the scale variation.
    • Use scene information without pre-extracted pose.
  • Reduce long window dependency
    • Rely on previous frames only to support online inference.
    • Shorten the temporal window to hasten recovery from frequent camera view shifting.

Related Works

WHAM

  • Two-Stage Training: Use large-scale synthetic data AMASS to learn for (frame-based) motion prior in stage 1, then use video to learn temporal correlation in stage 2.
  • Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.

TRAM

  • Uses the scene background to derive the motion scale.
  • Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.

MotionStreamer

  • Streaming Inputs + Online Outputs.
  • A temporal causal dependency of current and historical motion latent for more accurate online decoding.

References

Shin S, Kim J, Halilaj E, et al. Wham: Reconstructing world-grounded humans with accurate 3d motion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2070-2080.

Wang Y, Wang Z, Liu L, et al. TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 467-487.

Xiao L, Lu S, Pi H, et al. MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space[J]. arXiv preprint arXiv:2503.15451, 2025.

Methodology

  • Short windows.
  • Spatial-Temporal Modulation.
  • Reconstruction & Estimation & Prediction