Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.
Survey over Existing Datasets
GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.
ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.
There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:
Dataset | 3D Information | Human Activity | Kitchen |
Epic Kitchen | ✘ | ✔ | ✔ |
GMU Kitchen | ✔ | ✘ | ✔ |
MSR Action3D | ✔ | ✔ | ✘ |
ScanNet | ✔ | ✘ | ✘ |
Camera Setup
We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.
We then use AprilTags to define our world coordinates and calibrate camera extrinsics.
Scene Segmentation
- Used SAM to perform image segmentation
- Transfer the results from pixels to 3D points using camera pose
Human Pose Estimation
We use SAM and ViTPose to segment humans and estimate poses.
Future Work
Our next steps will be as follows:
- Semantic understanding of 3D kitchen scenes.
- Literature survey of visual imitation learning algorithms.
- Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.