Introduction

We introduce a conditional video distillation method for real-time co-speech video generation that leverages human pose conditioning for input-aware sparse attention and distillation loss. Our student model achieves 25.3 FPS, a 13.1x speedup over its teacher model, while preserving visual quality. Our method significantly improves motion coherence and lip synchronization over a leading few-step causal student model while reducing common visual degradation in the speaker’s face and hands (see yellow box). Example frames are from our curated YouTube Talking Video dataset © TED Conferences, LLC.

Background

Video generation is a fundamental task in computer vision that aims to synthesize realistic and temporally coherent video sequences from various inputs such as text, audio, poses, or static images. It holds significant potential in numerous applications, including content creation, virtual reality, digital humans, and simulation. Despite recent progress, generating long, high-quality, and controllable videos remains highly challenging. Existing models often struggle with temporal consistency, motion realism, and computational efficiency, particularly when conditioned on multi-modal inputs like speech or gestures. Moreover, most approaches focus on offline generation, limiting their usability in real-time or interactive scenarios.