Here is what we have done in the spring of 2023. Our presentation for the paper survey is here.
Survey over Existing Datasets
GMU-Kitchen is a dataset consisting of 9 RGB-D kitchen video sequences with object annotations. It lacks some details, and does not contain any human activity.
![GMU-Kitchen](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/gmukitchen.png)
ScanNet is a dataset of over 1600 indoor scenes with detailed annotations.
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/scannet.png)
There are also a couple of other datasets, but in they contain either only static scenes or no 3D information. A table here shows the difference:
Dataset | 3D Information | Human Activity | Kitchen |
Epic Kitchen | ✘ | ✔ | ✔ |
GMU Kitchen | ✔ | ✘ | ✔ |
MSR Action3D | ✔ | ✔ | ✘ |
ScanNet | ✔ | ✘ | ✘ |
Camera Setup
We have built a miniature synchronized multi-camera capture system in our lab and a digital twin in Mujoco to test our calibration and scene understanding algorithms; we plan to replicate the system in a real kitchen later.
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/camera_setup.png)
We then use AprilTags to define our world coordinates and calibrate camera extrinsics.
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/image-1024x713.png)
Scene Segmentation
- Used SAM to perform image segmentation
- Transfer the results from pixels to 3D points using camera pose
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/kitchen_0-1.png)
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/kitchen_0_seg-1.png)
Human Pose Estimation
We use SAM and ViTPose to segment humans and estimate poses.
Future Work
Our next steps will be as follows:
- Semantic understanding of 3D kitchen scenes.
- Literature survey of visual imitation learning algorithms.
- Build a pipeline that allows a robot to learn generalizable manipulation skills consequently helping them to perform tasks by watching demonstration videos.