Introduction

Fig. 1 The overall pipeline of our Multi-modal Video Generation Model. There are two possible appliations: 1) generate a multi-modal video given a text prompt; 2) generate a multi-modal video based on a RGB-only image/video input.

Motivation

  • Recent video generation models have emerged as strong world foundation models, serving as virtual environments for physical AI to grow in.
  • Simply RGB generation isn’t enough – we want a world model to be able to generate more modalities to mimic our real world, e.g. depth.

Task

  • Given an RGB video input, and/or a text prompt, jointly generate:
    • An RGB sequence
    • A depth sequence
  • The RGB and Depth sequences should be physically plausible and coherent with each other

Joint generation is preferred over combining RGB video generation and a downstream RGB-conditioned model for depth or other modalities. Using a single model for generation can offer faster inference speed and avoid information loss between models. We hope that joint generation will better capture the intrinsic correlations between appearance and geometry.