Dataset
The primary dataset we use for training is Mixkit from Open-Sora-Plan1 dataset. It is a high-quality open-source video dataset , created as part of open-sora-plan project to recreate OpenAI’s Sora model. It consists of 1234 videos, the total duration is about 6h 19m 32s, and the total number of frames is 570815.
Input data consists of sequences of:
- RGB video frames of resolution 1920×1080 and 1080×1920.
- Text Captions generated as descriptions of the videos.


Inference
To validate the correctness of our Megatron-based implementation of the VACE-Wan model, we first perform inference without any parallelism using both the Hugging Face VACE checkpoint and the Megatron Core–compatible checkpoint. Using the same random seed, both implementations produce identical output videos. We then conduct inference using tensor parallelism alone and context parallelism alone to evaluate and compare the memory usage and runtime characteristics of these different parallelization strategies.
Finetuning
After converting the Hugging Face VACE checkpoint from its original PyTorch format into a Megatron Core–compatible checkpoint, we fine-tune the model using our custom Megatron training recipe. We enable efficient distributed training across multiple GPUs by leveraging Megatron Core’s support for tensor, pipeline, context, and data parallelism. It allows us to scale sequence length and model size without exceeding memory limits. We also use mixed-precision computation and optimized attention kernels to maximize throughput and stability. During fine-tuning, we freeze the pretrained DiT backbone while training the context-adapter layer, adapting the model to support per-frame control signals and task-specific objectives. Using this recipe, we finetune for 2 tasks primarily:
Reference-Image2Video Generation
For image-to-video (I2V) generation, we condition the model on a single reference image provided as the first frame, with the remaining frames initialized as noise. The reference image is encoded into the latent space and injected through the context-adapter layer into the DiT backbone.
Video Inpainting
For video inpainting, we train the model to reconstruct missing or corrupted regions using spatio-temporal masks provided at each frame. The masked video latents, along with their corresponding binary masks, are used as per-frame control inputs through the context-adapter. The model learns to jointly reason over spatial structure and temporal continuity, filling in occluded regions while maintaining consistency across frames, and the loss computed selectively on masked regions, encouraging reconstruction of missing content without altering unmasked areas.
Results
Qualitative Results
Finetuning
The source video used for demonstration is shown below, depicting two cats boxing on a stage. The text prompt used for demonstration is “Two dogs fit each other during boxing”.
When conditioned on the depth video, as shown below, the VACE model performs decently, not only aligning with the text description but also maintaining strong temporal consistency and fluid motion.
When conditioned on the flow video, as shown below, the VACE model achieves the best performance, producing a clearer background and reduced distortion of the gloves.
However, when conditioned on the pose video, as shown below, the VACE model fails to achieve comparable quality to the depth and flow conditions, exhibiting deformed or missing gloves and inconsistent motion. Upon closer inspection, we observe that the pose video deviates significantly from the actual poses in the source video, indicating the need for a more accurate off-the-shelf pose estimation model. This also suggests that the VACE model is sensitive to the quality of the conditioning video.
When additionally conditioned on a reference image of a golden retriever, the VACE model successfully replaces the dog on the right with one exhibiting golden fur.

When additionally conditioned on a reference image of a golden retriever, the VACE model successfully replaces the dog on the right with one exhibiting golden fur.

When additionally conditioned on a reference image of a golden retriever, the VACE model successfully replaces the dog on the right with one exhibiting golden fur.

Finetuning
Reference Image to video generation results
Inpainting results
Quantitative Results
| Mode | Model | Setup | GPU Memory (per GPU) | Runtime |
|---|---|---|---|---|
| Inference | VACE-Base | 1× GPU (HuggingFace) | 28 GB | 3m 13s |
| VACE-Base | 1× GPU (Megatron) | 24.8 GB | 2m 19s | |
| VACE-TP | 2× GPUs (Megatron) | 31.5 GB | 1m 41s | |
| VACE-CP | 2× GPUs (Megatron) | 34.4 GB | 1m 19s | |
| Training | VACE-Base | 1× GPU (Megatron) | 28.31 GB | 56.68s |
| VACE-Base | 2× GPUs (Megatron) | 23.95 GB | 27.15s | |
| VACE-TP | 2× GPUs (Megatron) | 18.15 GB | 55.9s | |
| VACE-CP | 2× GPUs (Megatron) | 18.26 GB | 56.02s |
The results demonstrate that converting VACE to Megatron Core yields substantial efficiency gains in both inference and training. For inference on a single GPU, the Megatron implementation reduces peak memory usage (28 GB → 24.8 GB) while also improving runtime by over 1.4× compared to the Hugging Face baseline. Scaling to two GPUs further highlights the benefits of distributed execution: tensor parallelism (TP) and context parallelism (CP) significantly reduce end-to-end inference time, with CP achieving the fastest runtime (1m 19s) by effectively sharding long video sequences across GPUs, albeit with slightly higher per-GPU memory usage.
During training, Megatron enables efficient multi-GPU scaling even when fine-tuning only the context-adapter layers. Moving from one to two GPUs nearly halves the runtime while reducing per-GPU memory consumption, demonstrating effective data and model parallelism. TP and CP configurations further lower memory usage by distributing activations and sequence context, trading off longer runtimes due to increased communication overhead. Overall, these results show that Megatron Core provides a flexible set of parallelism strategies that allow VACE to balance memory efficiency, throughput, and scalability across different training and inference regimes.
References
- OpenSoraPlan A dataset for robot learning at scale, 2023. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0 ↩︎
