We include 2D and 3D pose estimation as part of our data collection pipeline. Human-pose can be useful to obtain a trajectory and interaction information which is a good prior for imitation frameworks.
We also include hand-pose estimation and segmentation.
Our data contains posed RGB-D images that can provide us with point clouds. Therefore, for scene-level annotations, we have focused on collecting 3D segmentation masks for point clouds.
Leveraging 2D Foundation Models For 3D Scene Segmentation
Modern 2D segmentation foundation models, enabled by transfer learning and large-scale datasets generate high-quality object masks. 3D segmentation models are not there yet. Given the fact that a majority of 3D data are collected by sensors that produce point clouds from RGB-D images, we aim to leverage the 2D foundation models that take RGB images as input to yield segmentation masks and develop an algorithm that utilizes those generated masks along with the depth information and scene geometry to facilitate 3D scene segmentation.
Each view has its corresponding color images and depth images, from which we can get a point cloud and 2D-to-3D correspondences. Using camera matrix of the next view, we can find 2D-to-2D correspondences between the current view and the next view.
The unsupervised foundation models do not provide semantic IDs for the segmentation generated. Moreover, due to occlusion, objects might be separated into two masks.
To determine the merging assignments of masks across views, we formulate a graph as shown in the diagram below. Based on 2D-to-2D correspondences, we decide the edges between two nodes. After updating all the masks in all the views, we eventually reach to a final graph and all the connected nodes in the final graph are assigned same ID.
This project aims at understanding kitchen in 3D to enable robot learning by watching video demonstrations. Humans are really good at watching demonstration videos and imitating given instructions such as cooking recipes but robots haven’t achieved that level yet where they can understand the 3D scene around them just by looking at 2D videos. Robots with such abilities can significantly impact the automation industry, not limited to kitchens or restaurants. We have found that the main bottleneck in accomplishing this task is training robots to learn generalizable manipulations skills because of variation in similar objects around us and lack of real-world datasets. Our goal in this project is to develop a data capture system that can yield a variety of human-level and scene-level annotations without relying on an expert for labeling that can further act as priors for robot learning algorithms.
Develop a multi-camera 3D kitchen capture system
Investigate algorithms for human-level annotations like human body pose and hand pose
Develop an unsupervised 3D scene segmentation algorithm leveraging foundation models
Among related datasets on the Internet, we found GMU Kitchen and ScanNet the most relevant to our task of scene understanding; however, none of them have 3D information, human activity, and kitchen environments simultaneously. Therefore, we are building a pipeline to collect our own dataset in a kitchen environment that includes both 3D information and human activities.