
Motivation
- Recent video generation models have emerged as strong world foundation models, serving as virtual environments for physical AI to grow in.
- Simply RGB generation isn’t enough – we want a world model to be able to generate more modalities to mimic our real world, e.g. depth.
Task
- Given an RGB video input, and/or a text prompt, jointly generate:
- An RGB sequence
- A depth sequence
- The RGB and Depth sequences should be physically plausible and coherent with each other
Joint generation is preferred over combining RGB video generation and a downstream RGB-conditioned model for depth or other modalities. Using a single model for generation can offer faster inference speed and avoid information loss between models. We hope that joint generation will better capture the intrinsic correlations between appearance and geometry.