Dataset
The primary dataset we use for training is BridgeDataV21. It is a real world robotic manipulation dataset with single arm manipulator, captured using teleoperation and scripted policy rollouts.
Input data consists of sequences of:
- Third person RGB video frames: recorded from a fixed camera viewpoint
- End-effector state of the robot: which includes the gripper pose, orientation, and gripper open/close status
- Task description: Natural language description of the high-level goal of the episode
- Robot action: corresponding to relative changes in pose at each timestep


Training Objective
For stage 1, we train the action decoder to infer the actions associated with the input video sequence to follow the trajectory of the robot. Inspired by latent diffusion models2, we train our diffusion action decoder with mse loss over the predicted noise in the denoising process.
Stage 1 Results
The results from the action decoder training show that the actions predicted from video sequence follow the groundtruth trajectory of the robot. Actions are represented in terms of 6DoF end-effector pose with translation and rotation as continuous values in the frame of base of the manipulator arm. We also plan to incorporate binary gripper state to the action representation.

