Related work


Kimera is a real time solution for metric semantic segmentation proposed by the MIT Spark lab. Metric-semantic understanding is the capability to simultaneously estimate the 3D geometry of a scene and attach a semantic label to objects and structures. Geometric information is critical for robots to navigate safely and to manipulate objects, while semantic information provides the ideal level of abstraction for a robot to understand and execute human instructions.

The authors provide the code as open source, and also offer a wide variety of debugging, visualization and benchmarking tools. This will encourage and accelerate the research in the field of real time 3D metric semantic segmentation. Kimera will build a solid basis for future metric-semantic SLAM and perception research, and will allow researchers across multiple areas (e.g., VIO, SLAM, 3D reconstruction, segmentation) to benchmark and prototype their own efforts without having to start from scratch.

Using just stereo RGB images and IMU data as input, Kimera produces a 3D metric-semantic mesh from semantically labeled images and runs in real-time on a CPU.

Through this paper we could understand the overall steps involved in creating a real time metric semantic mesh creation. We are specifically interested in the 2D semantic segmentation and the mesh generation aspect. Since Kimera wishes to solve the problem of SLAM as well as semantic mapping and we are only interested in the semantic mapping part of the task, we are not interested in the other modules of the paper. Also based on our payload and compute constraints we would like to use only a single RGB frame for a 2D image based semantic segmentation.

Kimera architecture for real time metric semantic segmentation


SOLOv2 is a simple, direct, and fast framework for instance segmentation in 2D images. It is empowered by an efficient and holistic instance mask representation scheme, which
dynamically segments each instance in the image, without resorting to bounding
box detection. SOLOv2 significantly reduces the inference time with a novel matrix non-maximum suppression (NMS) technique. The authors propose to learn adaptive and dynamic convolutional kernels for the mask prediction, leading to a much more compact and powerful design, which yields better results.

We propose to use the SOLOv2 architecture for segmenting 2D range image representation of our point clouds. This will ensure that the 3D point cloud segmentation can be done in real time.

SOLOv2 architecture for real time 2D instance segmentation


[1] Rosinol, Antoni, Andrew Violette, Marcus Abate, Nathan Hughes, Yun Chang, Jingnan Shi, Arjun Gupta, and Luca Carlone. “Kimera: From SLAM to spatial perception with 3D dynamic scene graphs.” The International Journal of Robotics Research 40, no. 12-14 (2021): 1510-1546.

[2] Wang, Xinlong, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. “Solov2: Dynamic and fast instance segmentation.” Advances in Neural information processing systems 33 (2020): 17721-17732.