Spring Progress - 3D Kitchen Understanding

Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.

Survey over Existing Datasets

GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.

ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.

There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:

Dataset	3D Information	Human Activity	Kitchen
Epic Kitchen	✘	✔	✔
GMU Kitchen	✔	✘	✔
MSR Action3D	✔	✔	✘
ScanNet	✔	✘	✘

Comparison of Some Existing Datasets

Camera Setup

We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.

We then use AprilTags to define our world coordinates and calibrate camera extrinsics.

Scene Segmentation

Used SAM to perform image segmentation
Transfer the results from pixels to 3D points using camera pose

Human Pose Estimation

We use SAM and ViTPose to segment humans and estimate poses.

Future Work

Our next steps will be as follows:

Semantic understanding of 3D kitchen scenes.
Literature survey of visual imitation learning algorithms.
Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.