Proposed Solution - World-Grounded Human Mesh Recovery from Video

1. Spatially consistent

Our proposed solution begins with a fundamental question: how can we ensure spatial consistency across frames?
When processing video, it’s not enough to predict accurate poses in each frame independently. We need the entire motion to stay coherent in space, without drifting, sudden jumps, or misaligned directions.
To achieve this, we propose using frame-to-frame transformations that align everything into a shared world coordinate system. This means that as the person moves, their global trajectory remains smooth and stable.

In order to ensure spatial consistency, our prior survey suggests that aligning predictions with gravity is meaningful, as it promotes both stability and coherence in the results.
In many monocular human mesh recovery systems, the predicted motion can appear unnaturally rotated or floating in space, especially when the camera moves
Therefore, we will use a gravity-view coordinate system. This coordinate system aligns the vertical axis with the gravity direction, so the reconstructed motion remains grounded and physically meaningful — regardless of how the camera moves

To achieve gravity alignment, we observe that the image should be minimally affected by perspective distortion. This highlights the importance of being perspective-aware.
In many existing pipelines, the camera is either ignored or approximated with a weak-perspective model. This often leads to distorted body shapes or inaccurate trajectories, especially when the camera moves significantly.
We achieve this by decoupling human and camera motion, and explicitly estimating perspective camera parameters.