Related Works
Recent advancements in 4D reconstruction and motion transfer have laid the groundwork for learning from monocular videos in robotic applications. We list some of the papers that have greatly influenced our work.
SLoMo introduced a three-stage pipeline that enables legged robots to imitate human and animal motions from casual monocular videos. By reconstructing keypoint trajectories and optimizing them into dynamically feasible reference motions, SLoMo demonstrated the feasibility of transferring in-the-wild movements to robotic platforms.
DualPM proposed a dual point map representation that predicts both posed and canonical 3D point maps from a single image. This approach facilitates the estimation of object deformation fields and supports amodal reconstruction, effectively handling occlusions and enabling generalization from synthetic to real-world images. DualPM‘s ability to model detailed deformations has inspired our use of similar representations to capture the nuances of animal and human motions.
3D-Fauna developed a pan-category deformable 3D model capable of reconstructing over 100 animal species from single-view internet images. By introducing the Semantic Bank of Skinned Models (SBSM), it leverages semantic priors to enhance generalization, particularly for rare species. This work underscores the potential of learning diverse 3D structures from unstructured 2D data, informing our strategy to utilize web-scale video data for robotic learning.
Collectively, these works demonstrate the potential in enabling robots to learn from unstructured visual data. Our project draws inspiration from their methodologies to develop a framework that bridges the gap between rich visual observations and actionable robotic behaviors.
References
[1] Ben Kaye, Tomas Jakab, Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Recon-
struction. arXiv:2412.04464 [cs.CV] https://arxiv.org/abs/2412.04464
[2] Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rup-
precht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. 2024. Learning the 3D
Fauna of the Web. arXiv:2401.02400 [cs.CV] https://arxiv.org/abs/2401.02400
[3] John Z. Zhang, Shuo Yang, Gengshan Yang, Arun L. Bishop, Deva Ramanan, and
Zachary Manchester. 2023. SLoMo: A General System for Legged Robot Motion
Imitation from Casual Videos. arXiv:2304.14389 [cs.RO] https://arxiv.org/abs/2304.14389
