Problem Statement

Given monocular video as input, how can we recover 3D human mesh that is spatially consistent, gravity-aligned, and perspective-aware in the world coordinate system?

  • Our proposed solution begins with a fundamental question: how can we ensure spatial consistency across frames?
  • Then, in order to ensure spatial consistency, our survey suggests that aligning predictions with gravity is meaningful, as it promotes both stability and coherence in the results.
  • To achieve gravity alignment, we observe that the image should be minimally affected by perspective distortion. This highlights the importance of being perspective-aware.