Human-level Annotations

We include 2D and 3D pose estimation as part of our data collection pipeline. Human-pose can be useful to obtain a trajectory and interaction information which is a good prior for imitation frameworks.

We also include hand-pose estimation and segmentation.

Scene-level Annotations

Our data contains posed RGB-D images that can provide us with point clouds. Therefore, for scene-level annotations, we have focused on collecting 3D segmentation masks for point clouds.

Leveraging 2D Foundation Models For 3D Scene Segmentation

Modern 2D segmentation foundation models, enabled by transfer learning and large-scale datasets generate high-quality object masks. 3D segmentation models are not there yet. Given the fact that a majority of 3D data are collected by sensors that produce point clouds from RGB-D images, we aim to leverage the 2D foundation models that take RGB images as input to yield segmentation masks and develop an algorithm that utilizes those generated masks along with the depth information and scene geometry to facilitate 3D scene segmentation.

Finding Correspondences

Each view has its corresponding color images and depth images, from which we can get a point cloud and 2D-to-3D correspondences. Using camera matrix of the next view, we can find 2D-to-2D correspondences between the current view and the next view.

Graph-Based Merging

The unsupervised foundation models do not provide semantic IDs for the segmentation generated. Moreover, due to occlusion, objects might be separated into two masks.

To determine the merging assignments of masks across views, we formulate a graph as shown in the diagram below. Based on 2D-to-2D correspondences, we decide the edges between two nodes. After updating all the masks in all the views, we eventually reach to a final graph and all the connected nodes in the final graph are assigned same ID.


Before merging:

After merging:

Output merged point cloud:

Data Collection Setup

Camera Setup

We have built a miniature synchronized multi-camera capture system in our lab for the scope of this project – this system can be replicated in a real kitchen later.

We also developed a digital twin in Mujoco to test our calibration and scene understanding algorithms.

Mujoco Setup

Camera Calibration

  • Intrinsics are calibrated by the factory for Azure Kinect cameras
  • Extrinsics are estimated by optimizing Procrustes with corresponding points
  • Anchor points are obtained by AprilTag detectors
AprilTag detectors



This project aims at understanding kitchen in 3D to enable robot learning by watching video demonstrations. Humans are really good at watching demonstration videos and imitating given instructions such as cooking recipes but robots haven’t achieved that level yet where they can understand the 3D scene around them just by looking at 2D videos. Robots with such abilities can significantly impact the automation industry, not limited to kitchens or restaurants. We have found that the main bottleneck in accomplishing this task is training robots to learn generalizable manipulations skills because of variation in similar objects around us and lack of real-world datasets. Our goal in this project is to develop a data capture system that can yield a variety of human-level and scene-level annotations without relying on an expert for labeling that can further act as priors for robot learning algorithms.

Research Objectives

  • Develop a multi-camera 3D kitchen capture system
  • Investigate algorithms for human-level annotations like human body pose and hand pose
  • Develop an unsupervised 3D scene segmentation algorithm leveraging foundation models


Among related datasets on the Internet, we found GMU Kitchen and ScanNet the most relevant to our task of scene understanding; however, none of them have 3D information, human activity, and kitchen environments simultaneously. Therefore, we are building a pipeline to collect our own dataset in a kitchen environment that includes both 3D information and human activities.

Dataset3D InformationHuman ActivityKitchen
Epic Kitchen
GMU Kitchen
MSR Action3D
Comparison of Some Existing Datasets
Sample scene from GMU Kitchen
Sample scene from ScanNet

Project Overview

This project aims at understanding kitchen in 3D to enable robot learning by watching video demonstrations.

The first part of the project, the data collection part, includes the following components: building a 3D kitchen capture system, reconstruct a real-world kitchen, capture 3D recordings of kitchen tasks, and developing rich 3D annotations. Here is an illustration from the project pitch:

In the spring semester, we have developed a synchronized multi-camera capture system. The extrinsic pose of each camera is estimated by detecting AprilTags. Here is our miniature camera setup, which we are going to move to the actual kitchen:

We can then capture a 3D video jointly with the cameras from different views:

We also use off-the-shelf models including ViTPose for human pose estimation:

In the next semester, we will continue our project by introducing semantic understanding of 3D kitchen scenes and build pipeline of visual imitation learning.

Spring Progress

Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.

Survey over Existing Datasets

GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.


ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.

There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:

Dataset3D InformationHuman ActivityKitchen
Epic Kitchen
GMU Kitchen
MSR Action3D
Comparison of Some Existing Datasets

Camera Setup

We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.

We then use AprilTags to define our world coordinates and calibrate camera extrinsics.

Scene Segmentation

  • Used SAM to perform image segmentation
  • Transfer the results from pixels to 3D points using camera pose

Human Pose Estimation

We use SAM and ViTPose to segment humans and estimate poses.

Future Work

Our next steps will be as follows:

  • Semantic understanding of 3D kitchen scenes.
  • Literature survey of visual imitation learning algorithms.
  • Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.


Tianwen Fu

Tianwen Fu is a masters student in Computer Vision at Carnegie Mellon University. He received my bachelor degree in science at the Chinese University of Hong Kong, where he worked with Prof. Chi-wing Fu in graphics. From 2020 to 2021, he worked as a research intern at SenseTime. He was supervised by Prof. Jifeng Dai on AutoML on vision tasks.

Achleshwar Luthra

Achleshwar Luthra is a masters student in Computer Vision at the Robotics Institute, Carnegie Mellon University. He completed his undergrad from BITS Pilani majoring in Electrical and Electronics Engineering. Prior to CMU, he has done research internships at UIUC and UC Berkeley on 3D reconstruction from images and videos.

Ben Eisner (Supervisor)

Ben Eisner is a Machine Learning and Robotics researcher. He builds robotic systems that learn to interact with the unstructured world. He is currently a Ph.D. student in the Robotics Institute at Carnegie Mellon University. He is a member of the Robots Perceiving and Doing Lab, led by Prof. David Held.

David Held (Supervisor)

David Held is an assistant professor at Carnegie Mellon University in the Robotics Institute and is the director of the RPAD lab: Robots Perceiving And Doing. His research focuses on perceptual robot learning, i.e. developing new methods at the intersection of robot perception and planning for robots to learn to interact with novel, perceptually challenging, and deformable objects. David has applied these ideas to robot manipulation and autonomous driving.