Spring Progress

Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.

Survey over Existing Datasets

GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.


ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.

There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:

Dataset3D InformationHuman ActivityKitchen
Epic Kitchen
GMU Kitchen
MSR Action3D
Comparison of Some Existing Datasets

Camera Setup

We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.

We then use AprilTags to define our world coordinates and calibrate camera extrinsics.

Scene Segmentation

  • Used SAM to perform image segmentation
  • Transfer the results from pixels to 3D points using camera pose

Human Pose Estimation

We use SAM and ViTPose to segment humans and estimate poses.

Future Work

Our next steps will be as follows:

  • Semantic understanding of 3D kitchen scenes.
  • Literature survey of visual imitation learning algorithms.
  • Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.