Experiments - Photometric Image Enhancement For Robotic Endoscopy

EndoSTTN

To set up the EndoSTTN training, we first organized the RGB images into its own video sequences. We then run specular highlight detection to obtain binary masks for each frame. We finetune a pretrained EndoSTTN model, which was trained on the HyperKvasir dataset, for an additional 110k iterations. We train this model using an NVIDIA GeForce RTX 2080 Ti graphics card.

DDRM

Before sampling from DDRM, we first need to train a DDPM model on clean, good quality EAD images. We set the model to take 4000 diffusion steps, and we ran it for 300k iterations. Once we have a pretrained DDPM model, we can use the binary specular highlight mask to restore our input images.

Inpainting Results

The results of EndoSTTN and DDRM inpainting are shown below. Since we do not have a groundtruth image, i.e. the “clean” version of our images, we need to use a single image quality assessment. We will use the Signal-to-Noise ratio metric, which represents the strength of the desired signal relative to the noise in the image. A higher SNR indicates a clearer image with less noise.

Figure 1. Specular highlight inpainting results using DDRM and EndoSTTN

Failure Cases

A notable failure case for both EndoSTTN and DDRM are when we try to inpaint specularity highlights with a large area. While both models perform generally well with small specks, they struggle when the area of light saturation becomes large.

Figure 2. Failure case. From left to right, we show the original image, specular highlight mask, DDRM result, and the EndoSTTN result.

YOLOv11 Segmentation Finetuning

We finetune the YOLOv11 segmentation model with EAD data. The images in the EAD dataset have variable sizes, and YOLOv11 can only operate on image sizes that are to the power of 2. Therefore, we created several 64×64 crops of each image to prepare our dataset. We train the model for 200 epochs. The qualitative results are shown below.