Introduction
Motivation
High-quality video diffusion models like Wan 2.1 are promising, but their 50-step generation is slow and expensive—about ~220s to generate a 480p 80-frame video in our setup. This makes them difficult to use for applications that need rapid iteration or large-scale sampling, where we want faster inference without catastrophic quality loss.
In this project, in collaboration with NVIDIA and Pika, our core motivation is to build a practical distillation pipeline that Pika can deploy to cut inference costs for their video generation services, while maintaining the visual quality their users expect.
Naive attempts to simply compress the teacher’s denoising trajectory into fewer steps tend to produce over-smoothed, low-contrast results and still rely on costly teacher generation. This motivates our use of Distribution Matching Distillation (DMD), which matches distributions rather than trajectories to train a 4-step student generator.
Method
Distribution Matching Distillation

To accelerate Wan while keeping its visual quality, we build on Distribution Matching Distillation (DMD), introduced by Yin et al. in “One-step Diffusion with Distribution Matching Distillation” (CVPR 2024) and extended in “Improved Distribution Matching Distillation for Fast Image Synthesis” (NeurIPS 2024).
At a high level, DMD provides a way to distill a slow, multi-step diffusion model into a fast few-step generator by matching distributions rather than individual sampling trajectories.
Setup
- Teacher (real score)
A pre-trained, many-step diffusion model (Wan 2.1 in our case) that we treat as the target distribution. - Student (generator)
A generator that produces samples in 4 denoising steps. This is the model we ultimately want to deploy. - Critic (fake score)
A diffusion-style network that estimates the score of the student’s distribution at different noise levels.
Key idea: match distributions, not trajectories
Traditional distillation for diffusion often tries to make the student mimic the teacher’s denoising trajectory step by step. This tends to:
- Over-constrain the student to specific paths in noise space
- Require large precomputed datasets of teacher trajectories
- Lead to over-smoothed, low-contrast generations when aggressively reducing steps
DMD takes a different approach:
- The student is trained so that the distribution of its outputs matches the teacher’s output distribution.
- There is no requirement that a specific student sample follow the same timestep path as any particular teacher sample — we only care that, over many samples, the student and teacher live on the same distribution.
- Practically, this is implemented by computing gradients from two score functions (one for the teacher / “real” distribution, one for the student / “fake” distribution) and using their difference to nudge the student toward higher realism.
In our implementation, we also avoid precomputing a large teacher dataset of trajectories and the heavy regression losses that come with it, following ideas from improved DMD variants. This makes training more storage-friendly and scalable, which is crucial in the video setting.
Our Pipelines
We explore two distillation pipelines for fast video generation.
First, we directly distill the base Wan 2.1 model with Distribution Matching Distillation, turning a slow 50-step teacher into a 4-step student that keeps most of the visual quality while cutting inference time dramatically.
Second, in collaboration with NVIDIA and Pika, we finetune Wan on a custom “Squish” video effect and then distill this effect-specific model into a few-step generator—showing that the same DMD pipeline can power both general-purpose and commercial, effect-specialized video generation.
Results
General-purpose few-step generator samples
Effect-specific few-step generator samples
Inference speed comparison
We also measure the inference speed-up after distillation. These results are measured by running inference for 480p videos on a single A100 GPU.
| Inference steps | Inference time per video | Relative speed | |
| Wan base model | 50 | 219.7s | 1x |
| Wan DMD student | 4 | 14.2s | 15.4x |
