December 2021 - Multimodal Fusion for Autonomous Driving

December 8, 2021April 29, 2022

Team

Students

Jeet Kanjani

I am a student in Master of Science in Computer Vision (MSCV) at Carnegie Mellon University.

I have experience working on Object Detection, Pose Estimation and Action Recognition. I have collaborated with Torr Vision Group and NVIDIA in the past.

Shubham Gupta

I am a graduate student (MSCV) at CMU (Robotics Institute, School of Computer Science). I worked at Apple for the Summer’21 internship (AI/ML)- Siri Visual Intelligence Team.

I did my undergrad at IIT Roorkee (India). I have around 3 years of industrial work experience in AI. I like to research and explore the latest techniques pertaining to AI. My goal is to continue developing AI-based solutions for multiple domains.

Project Responsibility

We explored the various datasets together – KITTI, nuScenes. Shubham primarily worked setting up the distance-based evaluation framework including dataset preparation, metrics, while Jeet worked on trying various Fusion methods – NMS, AdaNMS etc.

Advisor

Deva Ramanan

I am currently a principal scientist at Argo AI and the director of the CMU Argo AI Center for Autonomous Vehicle Research.

My research focuses on computer vision, often motivated by the task of understanding people from visual data. My work tends to make heavy use of machine learning techniques, often using the human visual system as inspiration. For example, temporal processing is a key component of human perception, but is still relatively unexploited in current visual recognition systems. Machine learning from big (visual) data allows systems to learn subtle statistical regularities of the visual world. But humans have the ability to learn from very few examples

December 8, 2021April 18, 2022

Results

Qualitative Results

In the following figure, we note that our Far nuScenes contain clean annotations at a distance and for scenes with cluttered objects (a). By comparing the predictions between CenterPoint (b) and FCOS3D (c), we observe the image-based FCOS3D contains higher quality predictions for Far3Det, which is not reflected in the standard evaluation. Our proposed fusion method AdaNMS (d) leverages their respective advantages and greatly improves detection of far-field objects. (Zoom in to see better.)

Quantitive Results

Here we show comparison of various evaluation protocols (3D mAP) on nuScenes for LiDAR-based CenterPoint and image-based FCOS3D. We calculate the AP metric using default thresholds of 0.5, 1, 2 and 4m, proposed adaptive linear threshold & quadratic threshold. We find low numbers for image-based method in 50-80m range using default metric that are not consistent with the visualization shown in Fig. 6. We posit this evaluation uses a too strict distance tolerance (e.g. 0.5m) in far field.

Far nuScenes version of Table 2. We observe similar trend as in but higher numbers. Recall that Far nuScenes is manually cleaned. We believe these improved results now realistically reflect the performance of 3D detectors for far-field detection.

Finally, we show quantitative evaluation (3D mAP) on Far nuScenes under our proposed metrics based on linearly-adaptive distance thresholds. First, we notice that all LiDAR- based detectors perform well for the near field but suffer greatly in the far-field. Among these detectors, the VoxelNet-backbone CP(CenterPoint) significantly outperforms the rest. The image-based detector FOCS3D significantly outperforms CP for far-field. All fusion methods are able to take the “best of both worlds”, resulting in a significant gain for far-field (50-80m) accuracy. While being much simpler, our proposed methods NMS and AdaNMS fusion significantly improve upon more complicated baselines for all classes except Pedestrian, where a lower overlap threshold for NMS on far-field hurts the recall for cluttered scenes. *CP-VoxelNet is the same as CP appearing elsewhere in this paper. **Note that CLOCs3D is an extension of CLOCs. AdaNMS has two versions, one trained with MVP,

December 8, 2021April 18, 2022

Implementation

Dataset

The primary reason why Far3Det is not explored is the difficulty in data annotation, i.e., it is hard to label 3D cuboids for far-field objects if they have few or no LiDAR returns. Despite this difficulty, a reliable validation set should be guaranteed for the study of Far3Det. One of our contributions is the derivation of such a validation set, alongside a designed evaluation protocol.

The figure shows ground-truth visualization for some well-established 3D detection datasets. We can see that existing datasets (the first three columns) have significant amount of missing annotations on far-field objects. To obtain a reliable validation set, we describe an efficient verification process for identifying annotators that consistently produce high-quality far-field annotations. This helps us identify high-quality far-field annotations, and derive Far nuScenes (rightmost) that supports far-field detection analysis

Below is the quantitative analysis of the missing annotations for far-field objects in the various datasets. We randomly sample 50 frames from each dataset and manually inspect missing annotations for beyond 50m objects to analyze the annotation quality if existing 3D detection datasets. This analysis suggests that the derived subset Far nuScenes has higher annotation quality compared to existing benchmarks, KITTI, Waymo and nuScenes (b) We compare the average number of annotations per frame at a given distance between Far nuScenes (yellow) over the standard nuScenes (blue), showing that the former (ours) has higher annotation density.

Evaluation Protocol

We design two metrics, linear and quadratic as shown below.
The quadratic distance-based threshold can be derived from the standard error analysis of stereo triangulation.

For the linear scheme, we have the threshold given by:

thresh(d) = d/12.5

For the quadratic scheme, we can define the threshold as

thresh(d) = 0.25 + 0.0125d + 0.00125(d²)

where d is the distance from ego-vehicle in meters

While standard metrics count positive detections using a fixed threshold (e.g.,
4m), we design more reasonable metrics with distance-adaptive thresholds. That said, we
adopt thresholds that grow linearly or quadratically w.r.t depth. This imposes not only reasonably relaxed thresholds for far-field objects as humans cannot also perceiving far-field localization, but also stricter thresholds for near-field objects.

Multimodal Sensor Fusion

NMS Fusion

To fuse LiDAR- and image-based detections, we can naively merge them. Under expectation, this will produce multiple detections overlapping with the same ground-truth object. To remove overlapping detections, we apply the well-known Non-Maximum Suppression (NMS). NMS first sorts the 3D bounding boxes w.r.t confidence scores. Then, it repeatedly picks the box with the highest confidence and discards all the boxes overlapping it.

Adaptive NMS (AdaNMS)

We notice that the far-field single modality detections are noisy that produces overlapping detections for the same ground-truth object.

Therefore, to suppress more overlapping detections in the far-field, we propose to use a smaller IoU threshold. To this end, we introduce a distance adaptive IoU threshold for NMS, AdaNMS for short. To compute the adaptive threshold for an arbitrary distance, we qualitatively select two IoU thresholds that work sufficiently well on close range and far-field objects.

We pick distance ranges d₁ = 10m and d₂ = 70m and qualitatively select thresholds c₁ = 0.2 and c₂ = 0.05 respectively.

December 8, 2021April 18, 2022

Motivation

Autonomous vehicles (AVs) must detect objects in advance for timely action to ensure driving safety. That said, AVs must accurately detect far-field objects while running at high speeds. Because a 60mph vehicle requires 60-meter stopping distance, AVs must detect far-field obstacles to avoid a potential collision into them. Interestingly, detecting far-field objects is also relevant for navigation in urban settings at modest speeds during precarious maneuvers, such as unprotected left turns where opposing traffic might be moving at 35mph, resulting in a relative speed of 70mph. These real-world scenarios motivate us to study the problem of far-field 3D object detection (Far3Det).

Status Quo

3D detection has been greatly advanced under AV research, largely owing to modern benchmarks that collect data using LiDAR (e.g., nuScenes, Waymo, and KITTI), which faithfully measures the 3D world and allows for precise localization in the 3D world. However, these benchmarks evaluate detections only up to a certain distance (i.e., within 50 meters from the ego-vehicle), presumably because near-field objects are more important that have an immediate impact on AVs’ motion plans. However, the aforementioned scenarios demonstrate that Far3Det, the 3D detection of objects beyond this distance (>= 50m), is also crucial.

December 8, 2021December 8, 2021

Fall 2021

In this semester, we explored far-field 3D detection. We derived an initial dataset, a novel metric, and a late fusion scheme to improve the SOTA for the same.

Final Presentation slides: [link]
Final Presentation Video: [link]

December 7, 2021December 8, 2021

Summary

Overview

We focus on the task of far-field 3D detection (Far3Det), the 3D detection of objects beyond a certain distance from an observer, e.g., >50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which requires the detection of far-field obstacles to ensure sufficient braking distances. We first point out that existing AV benchmarks (e.g., nuScenes and Waymo) underemphasize this problem since they evaluate performance only up to a certain distance (50m). One reason is that obtaining ground-truth far-field 3D annotations is difficult, particularly for LiDAR sensors that may produce only a few sparse returns for far-away objects. However, contemporary RGB cameras are much higher-resolution and capture stronger far-field signatures, motivating our exploration of LiDAR and RGB fusion for Far3Det. To do so, we derive high-quality far-field annotations for standard benchmarks. We demonstrate that simple distance-based late fusion of LiDAR and RGB detections significantly improves the state-of-the-art. Our results challenge the conventional wisdom that active LiDAR outperforms passive RGB for 3D understanding; once one looks far enough “out”, high-resolution passive imaging can be more effective.

Contributions

We derive an initial dataset for the problem of far-field 3D object detection (Far3Det) and perform an empirical evaluation of baseline algorithms on it. One salient conclusion is that RGB cameras can be more effective in the far-field. We revisit various late-fusion strategies for multimodal detection and find that simple fusion (that applies different fusion strategies for detections at different ranges) can produce significant improvements.
We hope our work will encourage further research on Far3Det.