In this figure, we have detailed the steps in our data collection and labelling pipeline. We make use of pseudo labels from 2D RGB images and then project them to the 3D point cloud using the camera pose and intrinsics. Here we are leveraging the robustness of well established 2D object detection and instance segmentation architectures (DETR, SAM) through knowledge distillation.
Real world data collection and labelling pipeline:
Step 1: Collect image sequences through RGB cameras and the corresponding 360 degree LiDAR scan.
Step 2: Perform object detection on individual RGB frames with “Detection Transformer (DETR)”.
Step 3: Using the bounding box information from each object as a query, we use “Segment Anything Model (SAM)” for obtaining the instance level masks for each of the detected objects.
Step 4: Using the camera extrinsics and the pose we can map the 2D labels with their corresponding point in the 3D space to get a labelled dataset of 3D point clouds for each frame.
We propose to use a 2D range image based instance segmentation of 3D point clouds.
Range Image is a 2D representation of a 3D point cloud using spherical projection as depicted in the figure. We choose this approach because of the efficiency of using range images and to leverage the existing single shot architectures for 2D instance segmentation.
In the figure below we have depicted two videos composed of range images for two different trajectories. The resolution of the video is 16×200, where each row corresponds to each of the 16 laser sensors of the VLP-16 LiDAR.
Training and inference pipeline for real time 3D instance segmentation of point clouds:
Step 1: Since a single scan of a VLP-16 LiDAR provides us with a very sparse point cloud, we employ an aggregation technique to combine 10 previous point cloud scans to make it denser.
Step 2: We then convert this dense point cloud to its corresponding range image.
Step 3: This range image will be the input to our instance segmentation model “SOLOv2” which gives a labelled range image as output.
Step 4: This 2D range image is again reprojected back to the 3D point cloud space to obtain the corresponding labelled point cloud.
 Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).
 Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers.” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213-229. Springer International Publishing, 2020.
 Wang, Xinlong, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. “Solov2: Dynamic and fast instance segmentation.” Advances in Neural information processing systems 33 (2020): 17721-17732.