Background

Depth Estimation

Many supervised approaches exist for depth estimation for both monocular and stereo camera systems. However, due to the lack of ground-truth, we specifically look into self-supervised approaches. The main paper we build on top of is Monodepth2.

Monodepth2 Architecture. We pass a sequence of images at consecutive timestamps and: (1) predict a disparity map with the depth network individually, (2) predict relative camera poses with the pose network for each pair-wise combination of the cameras with some fixed source frame. The disparity maps are then warped onto the common source frame based on the camera poses, and a re-projection loss is applied. In the end, the network is able to train depth and pose networks simultaneously in a self-supervised manner.

Thermal Camera

While normal color cameras are sensitive to lighting conditions, thermal cameras measure temperature of the scene thus less affected by the lighting. Below are some examples of thermal images with corresponding color images.

Challenges with Thermal

Low Resolution. Unlike RGB cameras, thermal images are generally much lower resolution and miss out on the texture that typical color cameras capture. As shown in the image, we cannot easily differentiate between the tree branches and leaves.
Distribution of Thermal Readings. Thermal images have a large range of values that depends on the time of day and the temperature, as well as the material properties. In the example above, we see a large variance of values in the first image, and a lower variance but very low values for the second image.