Project Summary

Model Pipeline:

The figure illustrates a two-stage framework for real-time co-speech video generation. The first stage, Audio2Motion, takes audio as input and leverages a Wav2Vec-based encoder combined with self-attention, cross-attention, and feed-forward layers to generate a sequence of human poses that synchronize with the speech. The second stage, Motion2Video, uses the predicted pose sequence along with a reference image and injected noise to guide a distilled student model for fast video synthesis. This model is conditioned on both VAE and CLIP features to preserve visual fidelity and identity consistency. By separating motion prediction from video generation and integrating multi-modal information, the framework enables the generation of realistic and personalized co-speech videos in a computationally efficient manner.