Technical report

Related Work:

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

In autonomous driving, cameras provide dense color information and fine-grained texture, but they are quite unreliable in low light conditions. LiDARs could offer accurate and wide-ranging depth information regardless of lighting variances but only capture sparse and textureless data. As camera and LiDAR sensors capture complementary information, it would be essential to conduct semantic segmentation through multi-modality data fusion. Existing fusion-based approaches require paired input data in both training and inference stages, which is not practical in most cases due to the difference of field of views between cameras and LiDars. It’s computational cost is also high as fusion-based models process both images and point clouds at runtime. A general training scheme was introduced as 2DPASS, by leveraging a multi-scale auxiliary modal fusion and knowledge distillation, to acquire richer semantic and structural information from the multimodal data.

The upper part is the general model. It first crops a small patch from the original camera image as the 2D input. Then the cropped image patch and LiDAR point cloud pass through the 2D and 3D encoders independently to generate multi-scale features. For each scale, we go through this MSFSKD which stands for multi-scale fusion-tosingle knowledge distillation process which is shown here. The modality fusion is first adopted to enhance multi-modality feature. And then, the enhanced feature promotes the 3D representation through the uni-directional Modality-Preserving KD to get the 3D Predictions. And after this part, the feature maps are used to generate the final semantic scores using modal-specific decoders, which are supervised by pure 3D labels.

Cross-modal Learning for Domain Adaptation in 3D Semantic Segmentation

The cross-modal learning for domain adaptation paper was the inspiration of the 2DPASS method. The same as the 2DPASS paper, the cross-modal learning framework also aims to take advantage of the domain gap differences between cameras and Lidars. In their model, a 2D and a 3D network take an image and a point cloud as inputs respectively and predict their own 3D segmentation labels, during which process the 2D predictions are uplifted to 3D. And then the crossmodal learning enforces consistency between the 2D and 3D predictions via mutual mimicking, which is beneficial for domain adaptation in both unsupervised and semi-supervised learning. And that leads to the main topic of this paper, is to constrain the network to make correct predictions on labeled data and consistent predictions across modalities on unlabeled target-domain data, which closely aligns our project that aims to give accurate segmentations on the rare conditions that may not exist in the training dataset.

The below part is the architecture of the method. There are two independent network streams: a 2D stream (in red) which takes an image as input and uses a U-Net-style 2D ConvNet, as well as a 3D stream (in blue) which takes a point cloud as input and uses a U-Net-Style 3D SparseConvNet. Then, the 3D points that have labels are projected into the image and the 2D features are sampled at the corresponding pixel locations. The four segmentation outputs consist of the main predictions and the 2D mimicry predictions, which are transferred across modalities using KL divergences to use the 2D mimicry prediction to estimate the main 3D prediction.

Methods and Experiments

To cope with the challenge that online-predicted low-conf points is not matching offline-trained low-metric classes, we use the following model to match the road conditions with the ground truth we have. As the pointcloud points are extremely imbalanced between road our target point and other points, we utilized methods to balance each head with dynamic weight average and leverage semantic scene information to introduce GT objects in a more realistic position.

After applying our baseline model 2DPASS to the road conditions, we computed the current F1 score and mIOU for the three target road conditions shown as:

To improve the model’s performance on rare objects, we adopted the 2D object detection model from paper: Unknown-Aware Object Detection: Learning What You Don’t Know from Videos in the Wild, CVPR, 2022 to discover unknown objects that are inconsistent by temporal and distribution. We are preparing to generate rare object training data by Stable Diffusion and GPT4 in the next steps. The object detection model’s achitecture is shown as:

Our current qualitative results on discovering unknown objects: