Problem Statement

Given monocular video as input, how can we recover 3D human mesh that is spatially consistent, gravity-aligned, and perspective-aware in the world coordinate system?