Problem Statement

Given monocular video as input, how can we recover 3D human mesh that is spatially consistent in the world coordinate system?