Real time instance level semantic segmentation of 3D point clouds in indoor environments.
Proposed Method
Data Pipeline
In this figure, we have detailed the steps in our data collection and labelling pipeline. We make use of pseudo labels from 2D RGB images and then project them to the 3D point cloud using the camera pose and intrinsics. Here we are leveraging the robustness of well established 2D object detection and instance segmentation architectures (DETR, SAM) through knowledge distillation.
2D instance segmentation based pipeline for 3D point cloud labelling
Real world data collection and labelling pipeline:
Step 1: Collect image sequences through RGB cameras and the corresponding 360 degree LiDAR scan. Step 2: Perform object detection on individual RGB frames with “Detection Transformer (DETR)”. Step 3: Using the bounding box information from each object as a query, we use “Segment Anything Model (SAM)” for obtaining the instance level masks for each of the detected objects. Step 4: Using the camera extrinsics and the pose we can map the 2D labels with their corresponding point in the 3D space to get a labelled dataset of 3D point clouds for each frame.
Proposed Solution
We propose to use a 2D range image based instance segmentation of 3D point clouds.
Range Image is a 2D representation of a 3D point cloud using spherical projection as depicted in the figure. We choose this approach because of the efficiency of using range images and to leverage the existing single shot architectures for 2D instance segmentation.
Point Cloud and corresponding Range Image example
In the figure below we have depicted two videos composed of range images for two different trajectories. The resolution of the video is 16×200, where each row corresponds to each of the 16 laser sensors of the VLP-16 LiDAR.
Range image videos for a two separate trajectories in indoor environments
Training and inference pipeline for real time 3D instance segmentation of point clouds:
Step 1: Since a single scan of a VLP-16 LiDAR provides us with a very sparse point cloud, we employ an aggregation technique to combine 10 previous point cloud scans to make it denser. Step 2: We then convert this dense point cloud to its corresponding range image. Step 3: This range image will be the input to our instance segmentation model “SOLOv2” which gives a labelled range image as output. Step 4: This 2D range image is again reprojected back to the 3D point cloud space to obtain the corresponding labelled point cloud.
Proposed Training and Inference Pipeline
References
[1] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).
[2] Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers.” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213-229. Springer International Publishing, 2020.
[3] Wang, Xinlong, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. “Solov2: Dynamic and fast instance segmentation.” Advances in Neural information processing systems 33 (2020): 17721-17732.