Introduction

Co-speech gesture video generation holds the potential to revolutionize human-computer interaction by enabling virtual agents, avatars, or social robots to communicate in a more natural, expressive, and human-like manner. However, existing methods often suffer from high computational latency, limited interactivity, and poor generalization to unseen speakers or diverse conversational contexts. Most notably, they either rely on offline autoregressive models that prevent real-time deployment, or fail to synchronize gestures with speech dynamics in a temporally coherent way. To bridge this gap, ChatAnyone aims to deliver real-time, interactive co-speech gesture video generation, enabling users to engage with virtual agents that can respond instantly with synchronized facial expressions, body movements, and hand gestures. This capability paves the way for applications in live streaming, virtual meetings, digital companions, and beyond, where responsiveness and realism are both crucial for effective communication.

Background

Video generation is a fundamental task in computer vision that aims to synthesize realistic and temporally coherent video sequences from various inputs such as text, audio, poses, or static images. It holds significant potential in numerous applications, including content creation, virtual reality, digital humans, and simulation. Despite recent progress, generating long, high-quality, and controllable videos remains highly challenging. Existing models often struggle with temporal consistency, motion realism, and computational efficiency, particularly when conditioned on multi-modal inputs like speech or gestures. Moreover, most approaches focus on offline generation, limiting their usability in real-time or interactive scenarios.