Dataset:
During pretraining, we use “high-level” ground truth texts from the HumanML3D dataset and “low-level” ground truth text labels from the BABEL dataset.
HumanML3D
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/05/humanml3d-1024x360.png)
- Statistics: 14,616 motions and 44,970 descriptions composed by 5,371 distinct words. The total length of motions amounts to 28.59 hours, corresponding to the “high-level” conditioning in our model
BABEL
- Contains 28k sequences with per-frame action label, for a total of 63k frame labels, corresponding to the “low-level” conditioning in our model
- Over 250 unique action categories
Quantitative Result
Qualitative Result
Comparison with Baseline Model
Text Prompt: “a man does a push-up and then uses his arms to balance himself back to his feet”
Our Result:
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/05/mdm_results_1-1024x287.png)
Baseline:
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/05/ours_results_1-1024x208.png)
Here the text prompt is unseen in the training set, and has some complexity. It can be observed that baseline method can not generate motion of “push-up” correctly. However, with Large Language Model to further explain “push-up” and “balance himself” to the model, our frameworks successfully generates the correct “push-up” motion and the subsequent motions.
Motion Synthesis with Same High-Level, Different Low-Level Text Prompts
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/12/image-15-1024x364.png)
Additional Qualitative Results
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/12/image-14.png)
GPT-3 Prompting Examples
![](https://mscvprojects.ri.cmu.edu/f23team11/wp-content/uploads/sites/88/2023/12/image-22-1024x439.png)