Method - Action Generation

General Framework

The initial video frame is first tokenized into latent representations using a video tokenizer. These initial video latents, along with tokenized language instructions, are then passed into the video generation model, which predicts future video latents. The predicted latents are subsequently detokenized to produce both the subsequent video frames and the corresponding future action sequences (e.g., joint angles and gripper states), enabling downstream robotic execution.

Stage 1: Training Action Detokenizer

We design a diffusion-based action detokenizer using a U-Net architecture, which predicts the noise component added to an action vector during the forward diffusion process. The model takes as input a noisy action vector and the corresponding noise level (time step), and is trained to estimate the noise using a mean squared error (MSE) loss in action space.

The U-Net is conditionally guided by the ground truth video latents, which are encoded using the Cosmos video tokenizer. This conditioning enables the detokenizer to align action predictions with the visual context captured in the video sequence.

During inference, the model generates an action sequence by iteratively denoising a sample initialized from Gaussian noise, conditioned on the given video latents. This allows the model to produce temporally coherent and context-aware action trajectories suitable for robotic execution.

Stage 2: Fine-tuning Video Generation Model

Building on pretrained tokenizers, we aim to fine-tune the video generation model. The model will take as input an initial camera frame along with a language-based task description, and will generate subsequent video frames that represent the rollout of the robot’s policy.

This visual prediction captures the evolving scene dynamics and provides implicit supervision for deriving robot actions aligned with the intended manipulation task. To explore modeling capacity and temporal coherence, we plan to experiment with both autoregressive and diffusion-based video generation approaches for next-frame prediction.

Through this setup, we aim to evaluate how effectively such models can be adapted to downstream robotic control, and to compare their ability to capture physically plausible scene transitions conditioned on high-level goals.