Calibration - Uncertainty in Image classification

Motivation: In order to investigate how model architecture might affect predicted confidence distribution, we propose to measure how different models are calibrated on detection task.

Dataset: COCO 2017, 2D detection.

Metric: mAP, ECE, D-ECE, Brier score

Candidate: GLIP(Vision-language model, zero shot), CenterNet-V1, CenterNet-V2, FCOS-R-50, FCOS-X-101, Faster-Rcnn, RetinaNet

Calibration method: beta calibration and histogram binning

Before calibration:

We first compare the performance of different model . Result shows the following observation:

GLIP are extremely overfidence for P< 0.7, and slightly underconfidence for P>0.7
CenterNet is perfectly calibrated at low P, be more and more underconfidence as P increase

Faster RCNN tend to predict extreme P(0 or 1), might to do with two stage structure(disregard RPN objectness confidence score). And D-ECE is sensitive to box position in image.
All method are overconfidence at low P, under-confidence at high P (except for two stage Faster RCNN!)
CenterNet2 (probabilistic aware) and RetinaNet (train with focal loss) is most well-calibrated in general

Confidence histogram and reliability diagram of faster-Rcnn and RetinaNet before calibration.

Difference between confidence and performance in each region of the image for Faster-Rcnn before calibration.

After calibration:

Surprisingly, after calibration, result shows GLIP receive the best performance in both mAP (detection accuracy) and ECE(uncertainty measurement). This shows that,

1. Simple classification calibration methods could already receive relatively good results on detection.
2. Large-scale pre-train is potentially useful for certainty aware since it knows more about ‘what is certain and what is uncertain’ in a more realistic setting.