Project Overview

This project aims at understanding kitchen in 3D to enable robot learning by watching video demonstrations.

The first part of the project, the data collection part, includes the following components: building a 3D kitchen capture system, reconstruct a real-world kitchen, capture 3D recordings of kitchen tasks, and developing rich 3D annotations. Here is an illustration from the project pitch:

In the spring semester, we have developed a synchronized multi-camera capture system. The extrinsic pose of each camera is estimated by detecting AprilTags. Here is our miniature camera setup, which we are going to move to the actual kitchen:

We can then capture a 3D video jointly with the cameras from different views:

We also use off-the-shelf models including ViTPose for human pose estimation:

In the next semester, we will continue our project by introducing semantic understanding of 3D kitchen scenes and build pipeline of visual imitation learning.

Spring Progress

Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.

Survey over Existing Datasets

GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.


ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.

There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:

Dataset3D InformationHuman ActivityKitchen
Epic Kitchen
GMU Kitchen
MSR Action3D
Comparison of Some Existing Datasets

Camera Setup

We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.

We then use AprilTags to define our world coordinates and calibrate camera extrinsics.

Scene Segmentation

  • Used SAM to perform image segmentation
  • Transfer the results from pixels to 3D points using camera pose

Human Pose Estimation

We use SAM and ViTPose to segment humans and estimate poses.

Future Work

Our next steps will be as follows:

  • Semantic understanding of 3D kitchen scenes.
  • Literature survey of visual imitation learning algorithms.
  • Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.

Team Members

David Held (Supervisor)

David Held is an assistant professor at Carnegie Mellon University in the Robotics Institute and is the director of the RPAD lab: Robots Perceiving And Doing. His research focuses on perceptual robot learning, i.e. developing new methods at the intersection of robot perception and planning for robots to learn to interact with novel, perceptually challenging, and deformable objects. David has applied these ideas to robot manipulation and autonomous driving. Prior to coming to CMU, David was a post-doctoral researcher at U.C. Berkeley, and he completed his Ph.D. in Computer Science at Stanford University. David also has a B.S. and M.S. in Mechanical Engineering at MIT. David is a recipient of the Google Faculty Research Award in 2017 and the NSF CAREER Award in 2021.

Ben Eisner (Supervisor)

Ben Eisner is a Machine Learning and Robotics researcher. He builds robotic systems that learn to interact with the unstructured world.

He is currently a 3rd-Year Ph.D. student in the Robotics Institute at Carnegie Mellon University. He is a member of the Robots Perceiving and Doing Lab, led by Prof. David Held. Right now, Ben is working on some techniques for policy transfer. His research is supported in part by the NSF Graduate Research Fellowship.

Here is a recent version of his academic cv, his Google Scholar profile, and his Github profile.

Tianwen Fu

Tianwen Fu is a masters student in Computer Vision at Carnegie Mellon University. He received my bachelor degree in science at the Chinese University of Hong Kong, where he worked with Prof. Chi-wing Fu in graphics. From 2020 to 2021, he worked as a research intern at SenseTime. He was supervised by Prof. Jifeng Dai on AutoML on vision tasks.

Achleshwar Luthra

Achleshwar Luthra is a grad student at the Robotics Institute, Carnegie Mellon University enrolled in MSCV. Achleshwar completed my undergrad at BITS Pilani majoring in Electrical and Electronics Engineering. Previously, he was an Undergraduate Research Intern at Computer Vision and Robotics Lab, UIUC under Prof. Narendra Ahuja, where he worked on 3D reconstruction of quadruped animals. Earlier he had the privilege to work in Prof. Jitendra Malik‘s group at UC Berkeley on Single-View 3D Reconstruction of inanimate objects.

During his undergrad, Achleshwar worked with Prof. Kamlesh Tiwari at AI Lab @ BITS Pilani in the domain of Activity Recognition in Videos and with Prof. Pratik Narang on using lightweight deep models for image restoration.