Given monocular video as input, how can we recover 3D human mesh that is spatially consistent, gravity-aligned, and perspective-aware in the world coordinate system?

- Our proposed solution begins with a fundamental question: how can we ensure spatial consistency across frames?
- Then, in order to ensure spatial consistency, our survey suggests that aligning predictions with gravity is meaningful, as it promotes both stability and coherence in the results.
- To achieve gravity alignment, we observe that the image should be minimally affected by perspective distortion. This highlights the importance of being perspective-aware.
