Overview

The overall pipeline of our Multi-modal Video Generation Model. There are two possible applications: 1) generate a multi-modal video given a text prompt; 2) generate a multi-modal video based on an RGB-only image/video input.

Motivation

Recent video generation models have emerged as strong world foundation models, acting as virtual environments where physical AI can grow and learn. They generate high quality RGB videos that serve as the visual representation of the world.

However, generating RGB alone is not enough. If we want a world model that truly mimics our real world, it should generate additional modalities, such as depth, that capture underlying geometry and physical structure.

In this project, we consider the task of joint multimodal generation:
Given an RGB video input and/or a text prompt, the model should jointly generate:

An RGB sequence
A depth sequence

These sequences should be physically plausible and mutually coherent: the depth should explain the RGB, and the RGB should be consistent with the implied 3D structure.

We specifically aim for joint generation within a single model, rather than first generating RGB video and then running a separate RGB-conditioned depth model. A unified model that generates both appearance and geometry encourages a shared, geometry-aware representation of the world.

Architecture

We build on NVIDIA’s Cosmos Diffusion-based World Foundation Models, with the following architecture:

The Cosmos video encoder model. The model consists of multiple stacked 3D convolutional modules and causal attention modules.

The Cosmos latent diffusion model. It consists of multiple stacked self-attention and cross-attention and MLP layers for one step of denoising.

Method

Our method mainly consists of 2 stages:

Training the Cosmos video tokenizer to additionally encode depth information
Training the diffusion transformer to generate tokens that can be decoded into RGBD videos

Stage 1 Autoencoder Training

Stage 1 training pipeline. The snow mark represents frozen modules and fire mark represents modules to be trained.

We add one input channel and one output channel to the video tokenizer to process depth. The tokenizer is then trained with paired RGB-D videos from synthetic datasets, using affine-invariant disparity(normalized to a median of 0 and mean deviation of 0.5). Apart from standard reconstruction losses used in AE training, we also adopt a distillation loss to align our RGBD latent space with the original RGB latent space. This is so that we can preserve diffusion priors from the original Cosmos video generation model.

Stage 2: Diffusion Model Training

Stage 2 training pipeline. We freeze the RGBD encoder we trained in stage 1. Only the latent diffusion model is being trained.

In stage 2 we plan on training the diffusion transformer with paired RGBD videos. Before training, we will also curate more video datasets, not only including synthetic datasets but also pseudo-labelled real datasets, and generate text captions for training.

Results

Currently, we only have results from the first stage of our method, i.e. RGBD autoencoder reconstruction results. For every sample shown, the left side is the input RGB and depth videos, and the right side is the reconstructed RGB and depth videos.

Reconstruction sample 1

Reconstruction sample 2

Reconstruction sample 3

Reconstruction sample 4

Conclusion & Future Work

As seen in the results, we yielded reasonable reconstructions and showed that Cosmos’s latent space can, in principle, support depth. However, fine details and sharp boundaries were consistently degraded, highlighting the difficulty of naïvely packing RGB and depth into a single, capacity-limited latent space.

Future directions include exploring factorized latent representations for appearance vs. geometry, increasing latent capacity, and improving loss design and supervision for depth.

Resources

Video Presentation:

https://drive.google.com/file/d/1jaXofzv7DgGcxXxojCbXH8B5QiGaznIv/view?usp=sharing