Pipeline
The figure illustrates a two-stage framework for real-time co-speech video generation. The first stage, Audio2Motion, takes audio as input and leverages a Wav2Vec-based encoder combined with self-attention, cross-attention, and feed-forward layers to generate a sequence of human poses that synchronize with the speech. The second stage, Motion2Video, uses the predicted pose sequence along with a reference image and injected noise to guide a distilled student model for fast video synthesis. This model is conditioned on both VAE and CLIP features to preserve visual fidelity and identity consistency. By separating motion prediction from video generation and integrating multi-modal information, the framework enables the generation of realistic and personalized co-speech videos in a computationally efficient manner.

Input-Aware Global Attention
To ensure that our temporal attention mechanism focuses on the relevant past information while maintaining causality, we introduce an input-aware global attention mask, called the global mask M_global. This mask guides the attention computation by identifying a relevant subset of historical frames for each current frame t_q.
For each frame t in {1, …, T}, we represent its pose using B upper-body keypoints, denoted as P_t in R^{B×2}. To measure the pose similarity between the current frame t_q and a previous frame t_k < t_q, we apply a global transformation matrix tau to P_{t_k} to compensate for the subject’s overall motion. The similarity S(t_q, t_k) is then computed as the minimum alignment error after the transformation.
Input-Aware Local Attention
To further enforce local consistency and direct temporal attention toward the relevant body parts in human subjects, we introduce an input-aware local mask, referred to as M_local. This mask partitions tokens into coherent local regions, defined by keypoint locations estimated using the rigid moving least squares transformation (Schaefer et al., 2006). This formulation allows dense attention within each frame while constraining inter-frame attention to correspondences across homologous local regions.
Let R = {faces, hands, arms, bodies, shoulders} be the set of local regions. Each region r in R corresponds to a fixed subset of token indices I_r ⊆ {1, …, N}.

