This project aims at understanding kitchen in 3D to enable robot learning by watching video demonstrations.
The first part of the project, the data collection part, includes the following components: building a 3D kitchen capture system, reconstruct a real-world kitchen, capture 3D recordings of kitchen tasks, and developing rich 3D annotations. Here is an illustration from the project pitch:
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/image-1-1024x457.png)
In the spring semester, we have developed a synchronized multi-camera capture system. The extrinsic pose of each camera is estimated by detecting AprilTags. Here is our miniature camera setup, which we are going to move to the actual kitchen:
![](https://mscvprojects.ri.cmu.edu/f23team8/wp-content/uploads/sites/85/2023/05/camera_setup-1.png)
We can then capture a 3D video jointly with the cameras from different views:
We also use off-the-shelf models including ViTPose for human pose estimation:
In the next semester, we will continue our project by introducing semantic understanding of 3D kitchen scenes and build pipeline of visual imitation learning.