Architecture - Latent Diffusion Models for 3D-aware Multi-modal Video Generation

The overall architecture of Cosmos Video Generatino Model. It mainly consists of two parts: 1) a pair of encoder and decoder to convert video between the RGB space and the latent space; 2) a latent diffusion model for generating video.

The Cosmos video encoder model. The model consists of multiple stacked 3D convolutional modules and causal attention modules.

The Cosmos latent diffusion model. It consists of multiple stacked self-attention and cross-attention and MLP layers for one step of denoising.