Video-Based Online Human Mesh Recovery
Group Members: Yiwen Zhao, Liting Wen, Aniket Agarwal, Ce Zheng, Laszlo Jeni
Introduction
Task:
estimate the parametric human mesh from the input video.
Our focus:
- Integrate spatial-temporal information from video
- Minimize the scale variation.
- Boost motion smoothness and temporal consistency.
- Online inference
- Minimize the scale variation.
- Use scene information without pre-extracted pose.
- Reduce long window dependency
- Rely on previous frames only to support online inference.
- Shorten the temporal window to hasten recovery from frequent camera view shifting.

Related Works
WHAM
- Two-Stage Training: Use large-scale synthetic data AMASS to learn for (frame-based) motion prior in stage 1, then use video to learn temporal correlation in stage 2.
- Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.
TRAM
- Uses the scene background to derive the motion scale.
- Global Trajectory & Local Pose: Estimate camera angular velocity and motion features, then refine the foot contact label.

MotionStreamer
- Streaming Inputs + Online Outputs.
- A temporal causal dependency of current and historical motion latent for more accurate online decoding.

References
Shin S, Kim J, Halilaj E, et al. Wham: Reconstructing world-grounded humans with accurate 3d motion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2070-2080.
Wang Y, Wang Z, Liu L, et al. TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 467-487.
Xiao L, Lu S, Pi H, et al. MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space[J]. arXiv preprint arXiv:2503.15451, 2025.
Methodology
- Short windows.
- Spatial-Temporal Modulation.
- Reconstruction & Estimation & Prediction
