Experiment Results

Multimodal Pedestrian Detection on KAIST

Quantitative Compomparison

Quantitative comparison on KAIST measured by LAMR↓ in percentage, on the two KAIST test-sets (old and new). We follow the literature that we evaluate in a “reasonable setting” [1], i.e., ignoring small or occluded persons. Our Bayesian Fusion approach (wtih bounding box fusion) is comparable in Table 1. We take reported numbers from [1] for most compared methods. Clearly, our Bayesian Fusion approach outperforms the prior methods by a large margin. Bolded numbers marks the best results

Ablation Study

Ablation study on KAIST new test-set under the “reasonable” setting, measured by percent LAMR↓. Please see text for a detailed discussion, but overall, we find our proposed BayesFusion approach to outperform all other variants, including end-toend learned approaches such as Early and MidFusion. Fig. 5 shows the corresponding MR-FPPI curves.

Quantitative Comparison

Qualitative results on three random testing examples in KAIST. Top: over RGB images, we overlay the detection results from our mid-fusion model. Bottom: on the thermal images, we show results from our best-performing Bayesian Fusion model. Green, red and blue boxes stand for true positives, false negative (mis-detected persons) and false positives. Visually, our Bayesian Fusion performs much better than the mid-fusion model

Multimodal Object Detection on FLIR

Quantitative comparison

Quantitative comparison on FLIR measured by AP↑ in percentage with IoU>0.5. Following the literature, we evaluate on the three categories annotated by FLIR. Perhaps surprisingly, end-to-end training on thermal images already outperforms all the prior methods, presumably because of better augmentations and a better pre-trained model (Faster-RCNN). Moreover, our fusion methods perform even better. Lastly, our Bayesian Fusion method performs the best. These results are comparable to Table 3.

Ablation Study

Breakdown analysis on FLIR day/night scenes (AP↑ in percentage with IoU>0.5). As FLIR does not have day/night tags on the images, we manually annotate them for this analysis. Clearly, incorporating RGB by our learning-based fusion methods notably improves performance on both day and night scenes. We explore late-fusion with detection outputs from our three models: Thermal, Early and Mid. We find all AvgScore, NMS and BayesFusion lead to better performance than the learning-based MidFusion model. Especially, BayesFusion performs the best; using bounding box fusion (bbox) improves further

Quantitative Comparison

Qualitative multimodal detection results on FLIR images. We show three examples (in columns) with RGB (top) thermal images (middle and bottom). We overlay the groundtruth annotations on the RGB, highlighting that RGB and thermal images are strongly unaligned. To avoid clutter, we do not mark class labels for the bounding boxes. On the thermal images, we show qualitative results from our thermal-only (mid-row) and best-performing BayesFusion (with bounding box fusion) model (bottom-row). Green, red and blue boxes stand for true positives, false negative (mis-detected persons) and false positives. Particularly from the third column, thermal-only model has many false negatives (or mis-detections), which are “bicycles”. Understandably, thermal images will not deliver strong signatures for bicycles, but RGB images do. This explains why our fusion model performs better in detecting bicycles.