Solution

Our Datasets

KAIST Multi-Spectral Dataset

The KAIST Dataset has multiple sequences of pairs of RGB and Thermal images. Initially built as a pedestrian detection benchmark, this dataset is one of the very few publicly available datasets to have RGB and Thermal together. Moreover, this data has images from all day and different weather conditions making it suitable for our experimentation.

KAIST Dataset. Each frame contains one RGB and one Thermal Image
NREC Dataset. Each frame contains an RGB camera, two near-infrared vertical cameras for stereo, and a thermal camera. The streams are all synchronized and rectified.

NREC Collected Data
The NREC dataset contains not just low-light but also off-road environments for which there aren’t any publicly available datasets yet. As such, NREC has collected its own data at various locations that are closer to our actual domain. An example of one such location during daytime is shown. We seek to train our models on daytime data and evaluate them in both day and night conditions. Note, however, we do not have any ground truth depth. As such we rely on creating pseudo ground truth from the vertical stereo pair.

Pseudo Ground Truth

Pseudo-Ground Truth. We estimate the depth using classical geometrical approaches as as Semi-Global Block Matching that take two input views and output a disparity map. The main idea is to search along the horizontal lines of the rectified images and search for the patches that minimize the cost. We use the OpenCV implementation of SGBM and tune the parameters for our use-case to use as ground truth.

Results Overview

We use the Monodepth2 architecture to train our models. In addition to the re-projection loss, we also investigate adding extra supervision with the pseudo-ground truth for RGB and thermal models. All of our results can be seen below.

RGB-Based Depth Estimation

RGB-based Depth Estimation. We train Monodepth2 on the RGB images in the NREC dataset and showcase the predicted disparity on the right. Our model is able to clearly identify all objects in the scene and predicted reasonable depths. We notice that even for the tree branches, our model groups all such branches together belonging to the same tree and assigns them similar depths.

Thermal-Based Depth Estimation

Thermal-based Depth Estimation. We train Monodepth2 on the thermal images in the NREC dataset and showcase the results above. We first preprocess the thermal image by applying min-max scaling and normalization, and show the preprocessed image on the left. As we can see, our thermal model is able to clearly detect the person and estimate the layout of the terrain.
Point Cloud Output for MonoThermal. We visualize the disparity map as a point cloud by de-projecting the pixels in the RGB image based on the intrinsics and showcase the full 3-D point cloud. Our model is able to clearly detect the person and their depth relative to the rest of the scene. We, however, notice some issues with the trees and the sky which are likely artifacts from the pseudo-ground truth.

Quantitative Analysis

Quantitative Results. We calculate accuracy based on the relative error between the predicted and ground truth disparity and measuring them within different thresholds.

We train Monodepth2 with both RGB and thermal with both self-supervision and stereo pseudo-ground truth. We notice that the thermal model and the RGB model have a large difference in the self-supervised case for the strictest delta threshold. However, when we add the pseudo ground-truth, we are able to close the gap between RGB and thermal almost entirely and achieve really good performance. Since thermal images have less of a domain shift from day to night, our model is likely able to perform the same in day and night, meaning our model is nearly as good as if the RGB model was running on day time!