Recent advances in video generation have demonstrated impressive visual quality, but current models remain difficult to scale, control, and adapt to real-world use cases. Large video diffusion models are computationally expensive to train and slow to run, while still offering limited controllability over the generated content. Most systems rely on coarse global conditioning and support only text or image inputs, making it challenging to perform precise, temporally consistent video editing or to incorporate richer, time-varying guidance. These gaps motivate the need for a scalable video generation framework that supports fine-grained control and efficient large-scale deployment.
Key challenges:
- Restricted input modalities: Most models support only text and image conditioning, limiting their ability to leverage richer, per-frame or structured control signals.
- High computational cost: State-of-the-art video generation models are large and slow, requiring significant GPU resources for both training and inference.
- Limited controllability: Existing models lack fine-grained, frame-level control, making it difficult to enforce temporal constraints or perform precise video edits.

Goal
Our goal is to build a scalable and controllable video generation system based on VACE, a state-of-the-art video diffusion framework that enables per-frame control signals to be provided as context during generation. By supporting fine-grained, time-varying conditioning, such as masks, reference frames, and depth, our system aims to unlock more precise and expressive video editing and generation capabilities.
A central focus of this project is efficient large-scale training and inference. We implement VACE using the NVIDIA Megatron Core library to leverage advanced parallelism strategies, including tensor, pipeline, context, and data parallelism. This allows the model to scale seamlessly to thousands of GPUs, making it feasible to train and deploy large video generation models with long temporal context while maintaining high throughput and memory efficiency. Together, these contributions bridge cutting-edge video modeling with production-ready distributed systems design.
