Problem Statement - World-Grounded Human Mesh Recovery from Video

Given monocular video as input, how can we recover 3D human mesh that is spatially consistent in the world coordinate system?