Initially, we proposed a range image based approach for 3D instance segmentation as it is extremely real-time and efficient.
Range Image is a 2D representation of a 3D point cloud using spherical projection as depicted in the figure. We choose this approach because of the efficiency of using range images and to leverage the existing single shot architectures for 2D instance segmentation.
In the figure below we have depicted two videos composed of range images for two different trajectories. The resolution of the video is 16×200, where each row corresponds to each of the 16 laser sensors of the VLP-16 LiDAR.
However, after some initial training, we observed that even a small error in calibration would lead to a mismatch between the 2D and 3D projection. This would lead to large errors in the 3D point cloud space for pixel level errors in the 2D space. We. tried several clustering techniques in 3D to mitigate the issue but some shadows would prevail that would lead to large errors. Thus, even though real-time, range image based approaches are more suitable for early fusion to provide cues to the model on where the object ‘might’ be present in 3D.
3D Object Detection using Lidar Point Clouds
We propose a 3D object detection framework to get semantic understanding of the 3D scene around the robot through lidar scans. Following is the rough pipeline we follow for 3D object detection:
More concretely, following is the architecture we use for 3D object detection. We first convert the lidar point cloud into pillars. We then encode the pillars using a pillar aware attention mechanism and scatter the pillar encodings back to their original locations. We then get a BEV pseudo image of the entire scene. We use Mini-BiFPNs to extract a feature encoding from the pseudo image and pass it to a detection head to predict 3D bounding boxes.
There are very few datasets available that provide 16 channel lidar scans with bounding box annotations. We test our approach on the following datasets:
The JRDB dataset consists annotation only for the person class and is predominantly used for pedestrian detection and action recogition in 3D. It consists of two VLP-16 lidars mounted on the top and bottom of a robot to make a 32 channel lidar scan per frame. As our robot will only have a single VLP-16, we only consider a single lidar as a datapoint for training the model. The highlight of the JRDB dataset is that it contains lidar scans from indoor settings that are very suitable for our use case. We also discuss in further sections on how to extend the JRDB dataset as well as data collected on our own robot with psuedo annotaitons.
The CODa dataset is a fairly recently released dataset that provides 3D object detection annotations for a huge set of classes both in an indoor and outdoor setting. The lidar used is a 128 channel lidar. We subsample from the 128 channels to convert the lidar space into 16/32 channels while maintaining the POV (-15 to +15) specific to our VLP-16 lidar. We mainly train our models on the following classes present in indoor settings: (person, chair, table, couch, dustbin)
In this figure, we have detailed the steps in our data collection and labelling pipeline. We make use of pseudo labels from 2D RGB images and then project them to the 3D point cloud using the camera pose and intrinsics. Here we are leveraging the robustness of well established 2D object detection and instance segmentation architectures (DETR, SAM) through knowledge distillation.
Real world data collection and labelling pipeline:
Step 1: Collect image sequences through RGB cameras and the corresponding 360 degree LiDAR scan.
Step 2: Perform object detection on individual RGB frames with “Detection Transformer (DETR)”.
Step 3: Using the bounding box information from each object as a query, we use “Segment Anything Model (SAM)” for obtaining the instance level masks for each of the detected objects.
Step 4: Using the camera extrinsics and the pose we can map the 2D labels with their corresponding point in the 3D space to get a labelled dataset of 3D point clouds for each frame.
In this work, we provide a solution for extracting semantic information from lidar point clouds as sparse as 16 channels in an indoor setting. We test the learnability of current pillar based 3D object detection networks on VLP-16 lidars for indoor classes. Due to the lack of annotated VLP-16 lidar data, we provide a framework to pseudo annotate 3D point clouds assuming the availability of corresponding 2D images and calibrations.
 Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao et al. “Segment anything.” arXiv preprint arXiv:2304.02643 (2023).
 Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers.” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213-229. Springer International Publishing, 2020.
 Wang, Xinlong, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. “Solov2: Dynamic and fast instance segmentation.” Advances in Neural information processing systems 33 (2020): 17721-17732.
 Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
 Duy Tho Le, Hengcan Shi, Hamid Rezatofighi, and Jianfei Cai. Accurate and real-time 3d pedestrian detection using an efficient attentive pillar network. IEEE Robotics and Automation Letters, 8(2):1159–1166, 2022