Overview - Making Video Foundation Models Practical: From Physical Modalities to Fast Inference

Modern video diffusion and world models can generate stunning, high-resolution videos—but they are often either not physically grounded enough or too slow and expensive to use at scale. This project explores both sides of that problem across two semesters:

2025 Spring: Richer physical representations through joint generation in a world model (Cosmos).
2025 Fall: Efficient few-step generation in an open-source video model (Wan).

In the first semester, we collaborated with NVIDIA on Cosmos, a world foundation model designed for training physical AI in simulation. We explored extending Cosmos beyond pure RGB and towards physical modalities such as depth and motion.

In the second semester, in collaboration with NVIDIA and Pika, we shifted to making video diffusion more practical and deployable by improving sampling efficiency. We turned to Wan, a powerful open-source video diffusion model, and applied Distribution Matching Distillation (DMD) to turn Wan into a few-step video generator.