Video-Based Online Human Mesh Recovery

Group Members: Yiwen Zhao, Liting Wen, Aniket Agarwal, Ce Zheng, Laszlo Jeni

TEAM INFO

Introduction

Task:

estimate the parametric human mesh from the input video.

Our focus:

Integrate spatial-temporal information from video
- Minimize the scale variation.
- Boost motion smoothness and temporal consistency.

Online inference
- Minimize the scale variation.
- Use scene information without pre-extracted pose.

Reduce long window dependency
- Rely on previous frames only to support online inference.
- Shorten the temporal window to hasten recovery from frequent camera view shifting.

Related Works

WHAM

Two-Stage Training: Use large-scale synthetic data AMASS to learn for (frame-based) motion prior in stage 1, then use video to learn temporal correlation in stage 2.
Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.

TRAM

Uses the scene background to derive the motion scale.
Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.

MotionStreamer

Streaming Inputs + Online Outputs.
A temporal causal dependency of current and historical motion latent for more accurate online decoding.

References

Shin S, Kim J, Halilaj E, et al. Wham: Reconstructing world-grounded humans with accurate 3d motion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2070-2080.

Wang Y, Wang Z, Liu L, et al. TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 467-487.

Xiao L, Lu S, Pi H, et al. MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space[J]. arXiv preprint arXiv:2503.15451, 2025.

Methodology

Short windows.
Spatial-Temporal Modulation.
Reconstruction & Estimation & Prediction