Stage 1 Autoencoder Training

Stage 1 training pipeline. The snow mark represents frozen modules and fire mark represents modules to be trained.

We add one input channel and one output channel to the video tokenizer to process depth. The tokenizer is then trained with paired RGB-D videos from synthetic datasets, using affine-invariant disparity(normalized to a median of 0 and mean deviation of 0.5). Apart from standard reconstruction losses used in AE training, we also adopt a distillation loss to align our RGBD latent space with the original RGB latent space. This is so that we can preserve diffusion priors from the original Cosmos video generation model.

Stage 2: Diffusion Model Training

Stage 2 training pipeline. We freeze the RGBD encoder we trained in stage 1. Only the latent diffusion model is being trained.

In stage 2 we plan on training the diffusion transformer with paired RGBD videos. Before training, we will also curate more video datasets, not only including synthetic datasets but also pseudo-labelled real datasets, and generate text captions for training.