Comparisons
As shown, our method achieves approximately 3× faster inference compared to existing audio-driven and pose-driven baselines. In addition to speed, our approach produces higher-quality and more realistic results. Compared with audio-driven methods, our model not only maintains high generation quality but also substantially improves Sync-C and HKC. In particular, the lip synchronization confidence increases from 4.36 to 7.26. Moreover, compared with S2G-MDD, our method improves HKC from 0.956 to 0.968 on the test set.
Compared with pose-driven methods, our approach outperforms all baselines in both lip synchronization and overall motion quality. Remarkably, our student model, when compared to its teacher model MimicMotion, achieves a 13.1× inference speed-up without sacrificing generation quality, while also further improving motion and synchronization metrics. Specifically, our method improves HKC from 0.928 to 0.948 and Sync-C from 4.56 to 7.28, demonstrating enhanced hand motion confidence and lip synchronization performance.


Ablation Studies
We conduct an ablation study to analyze the contribution of each component. Using the teacher model as the baseline, we observe strong motion quality and lip synchronization performance after fine-tuning on the co-speech dataset. However, its inference speed remains a bottleneck at 1.93 FPS.Both input-aware global attention and input-aware local attention preserve generation quality without degradation, but the speed improvements they bring are limited. Directly applying DMD distillation provides a substantial speedup but introduces noticeable quality degradation, especially artifacts on faces and hands.By incorporating our input-aware distillation strategy, the model achieves real-time performance at 25.31 FPS while maintaining generation quality comparable to the fine-tuned teacher model.

We present a detailed breakdown of the inference time across different architectural variants by decomposing the total runtime into four components: Attention, Linear, Norm, and Others. The baseline teacher model requires 103.6 seconds to process an 8-second video. By introducing global attention, we reduce the processing time to 60.9 seconds, primarily due to reductions in attention and linear computation. Adding local attention further decreases the time to 45.2 seconds. Finally, applying distillation reduces the runtime to 7.9 seconds, achieving a 13.1× speedup compared to the teacher model, largely driven by the substantial reduction in attention cost. This demonstrates the effectiveness of our sparse attention strategy in enabling real-time co-speech video generation.

