Model
Backbone – WAN 2.1

Wan 2.1 is a state-of-the-art video generation model capable of producing high-quality, temporally coherent videos from text prompts alone or from text prompts along with reference images. It leverages large-scale video–language pretraining, diffusion-based generative modeling, and advanced spatiotemporal attention to ensure both visual realism and motion consistency.

VACE – WAN
VACE (All-in-One Video Creation and Editing) is a unified multimodal framework designed to handle all major video generation and editing tasks within a single architecture. It encompasses various tasks, including reference-to-video generation (R2V), video-to-video editing (V2V), and masked video-to-video editing (MV2V), allowing users to compose these tasks freely. This functionality enables users to explore diverse possibilities and streamlines their workflows effectively, offering a range of capabilities, such as Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything, and more.

The figure above illustrates controllable video generation using the VACE framework. At the bottom, the model receives content inputs such as text and a reference image. At the top, it receives structural control signals—including depth, surface normals, bounding boxes, and pose—which enforce geometry, motion, and spatial layout. All inputs are processed together by the Wan model to produce a video output that follows both the content description and the specified structural constraints.

The figure above illustrates controllable video editing using the VACE framework. At the bottom, the model receives content inputs such as text and original video frames. At the top, it receives editing control signals—including reference images, inpainting masks, and outpainting regions—which specify how portions of the video should be modified. All inputs are processed together by the Wan model to produce a video output that reflects both the original content and the applied editing constraints.
WAN DiT Diagram

The diagram illustrates the core architecture of the Wan model. Video inputs are converted into latent representations using the VAE encoder, while text prompts are converted into tokens through a text tokenizer. These latent video tokens and text tokens are then jointly processed by a stack of Transformer blocks, which iteratively refine and denoise the video latents. After passing through all Transftormer layers, the model produces clean video latents, which are subsequently decoded by the VAE decoder to generate the final video frames. This design—latent-space diffusion combined with a transformer denoising backbone—enables Wan to perform efficient, high-quality video generation conditioned on text.
VACE WAN Diagram

The VACE framework extends the Wan DiT backbone by introducing context tokens to incorporate fine-grained control into video generation and editing. These context tokens are derived from additional conditioning inputs, such as control videos and editing masks, and are designed to explicitly encode information beyond the natural video content.
To stabilize training and handle heterogeneous visual inputs, the framework introduces concept decoupling to explicitly separate natural video content from control signals. Using a mask , the input frames are decomposed into two spatiotemporally aligned sequences: reactive frames , which contain pixels to be modified, and inactive frames , which store pixels to be preserved.
For context latent encoding, , , and the mask are mapped into the same latent space as the noisy video latents used by DiT. The reactive and inactive frames are encoded using the video VAE to preserve spatiotemporal consistency, while reference images are encoded separately and concatenated along the temporal dimension with appropriate handling during decoding. The mask is reshaped and interpolated to match the latent resolution. As a result, all context components are aligned in a unified latent representation of shape .
Finally, the Context Embedder is introduced by extending the original embedding layer. The latent representations of , , and are concatenated along the channel dimension and tokenized into context tokens. The embedding weights for and are reused from the original video embedder, while the weights for the mask tokens are initialized to zero. This design enables efficient integration of context information without disrupting the pretrained video embedding structure.
VACE Training

To incorporate VCU as input, one straightforward approach is to fully fine-tune the entire DiT model. In this setup, context tokens are concatenated with the noisy video tokens, and all parameters—including those of the original DiT and the newly introduced Context Embedder—are updated during training. While effective, this approach is computationally expensive and tightly couples the context capability to the foundation model.
To enable faster convergence and a more modular, plug-and-play design, an alternative Context Adapter Tuning strategy is proposed. Instead of modifying the main DiT, several Transformer blocks are copied from the original DiT to form a set of lightweight, cascaded Context Blocks. The frozen DiT processes video and text tokens as usual, while the Context Blocks process context tokens together with text tokens. The outputs of the Context Blocks are then injected back into the DiT blocks as additive residual signals, guiding generation and editing. In this design, only the Context Embedder and Context Blocks are trainable, allowing efficient adaptation without altering the core foundation model.
Megatron-LM
Megatron-LM is a large-scale deep learning framework developed by NVIDIA for efficiently training and serving very large Transformer models. It is designed to scale to billions and trillions of parameters by combining multiple forms of parallelism—data parallelism, tensor parallelism, context (sequence) parallelism, and pipeline parallelism—within a single unified system. Megatron-LM provides optimized Transformer building blocks (e.g., parallel linear layers, attention modules, fused kernels) and is widely used as the foundation for training state-of-the-art large language and multimodal models on multi-GPU and multi-node clusters.
Tensor Parallelism

Tensor parallelism for the MLP layer works by splitting the large matrix multiplications across multiple GPUs so they can be computed efficiently in parallel. The MLP contains two projections, followed by . Under tensor parallelism, the first weight matrix is column-sharded, dividing the intermediate hidden dimension across GPUs, while the second matrix is row-sharded to match that split.
In the first projection, the input is broadcast to all GPUs, and each GPU computes a partial activation such as and . These partial results together form the full intermediate representation . In the second projection, each GPU multiplies its own hidden slice with the corresponding shard of , producing partial outputs and .
Finally, an All-Reduce operation is used to sum the partial outputs into the complete output . This approach distributes the computational workload across devices while requiring only minimal synchronization, enabling extremely large MLP layers to be trained at scale.

Tensor parallelism for the attention layers works by splitting the attention heads across multiple GPUs so that each device computes attention for only a subset of heads. For self-attention, the input is first broadcast to all GPUs, where each GPU independently computes its own shard of the query, key, and value projections.
Once queries, keys, and values are computed, each GPU performs attention independently within its own head subset. This includes computing attention scores, applying softmax, performing dropout, and forming the partial attention output or . After local attention outputs are produced, they are each passed through their corresponding shard of the output projection matrix , producing partial results such as and .
To obtain the final output , an All-Reduce operation is used to sum these partial outputs across GPUs. This mirrors the behavior of tensor parallelism in the MLP layer and introduces the only communication step in the attention layer’s forward pass.
Context Parallelism
Context parallelism scales attention to long sequences by partitioning the input along the sequence dimension across multiple GPUs, rather than splitting model parameters or attention heads. Each GPU is assigned a contiguous block of the sequence and is responsible only for computing attention outputs for its local queries. Accordingly, GPU 1 holds , GPU 2 holds , and so on, ensuring that memory usage per device grows only with a fraction of the full context length.
The attention computation begins locally. At the initial iteration, each GPU computes attention between its local queries and its local keys and values, producing a partial output. This establishes the first contribution to the final attention result for each query block without requiring any cross-GPU communication beyond the local device.
To incorporate the full context, Ring Attention is used to rotate key–value blocks among GPUs in a fixed ring topology. At each iteration, every GPU sends its current and block to the next GPU in the ring and receives a new block from its neighbor. Upon receiving a new shard, the GPU computes attention between its local queries and the received keys and values, accumulating the result into its output. This process repeats until all key–value blocks have circulated through the ring.
After completing the full rotation, each GPU has attended to all tokens in the global context and holds the complete attention output for its own query block. No further synchronization or reduction is required. By distributing the sequence dimension and using efficient ring-based communication, context parallelism with Ring Attention enables exact attention computation for very long sequences while keeping per-GPU memory and communication costs manageable.
