Introduction

Problem Formulation


  • Goal: Text-driven human motion synthesis
  • Input: Text specifying the human motion 
  • Output: Synthesized 3d human motion matching the input text

Our objective is to develop a model for synthesizing human motion conditioned on text input. The model will take as input a text description of a human motion and produce a corresponding synthesized 3D human motion as output. The model should be able to generate accurate and realistic 3D human motion even for motions it hasn’t encountered during training.

Why Challenging?


Building a text-driven human motion synthesis model is a challenging task due to the high degree of freedom involved in human motion, ambiguity of natural language input in text-to-motion synthesis, and difficulty in achieving natural-looking human motions.

Dataset: HumanML3D


Statistics: 14,616 motions and 44,970 descriptions composed by 5,371 distinct words. The total length of motions amounts to 28.59 hours.
Human Body Representation: Skinned Multi-Person Linear Model (SMPL)

Our Contribution


  1. We present a novel text-driven framework synthesizing 3d human motion with high fidelity.
  2. Embracing the coarse-grained and fine-grained text description by paraphrasing via a large language model, our framework can generalize to novel text description.
  3. By enforcing the physical constraint in an end-to-end differentiable manner, our framework can synthesize more natural human motion.