Methodology

Our main focus of this semester was addressing specular highlights. In this page, we will describe the dataset that we used to train and test our models. In order to inpaint specular highlights, we experimented with two generative computer vision methods. Finally, we will explain our process in finetuning a YOLO model for artifact segmentation.

Dataset

The dataset that we used this semester was the Endoscopy Artifact Detection (2019) dataset challenge [1]. This public dataset includes image data from multiple different tissues, including gastroscopy, cystoscopy, gastrooesophageal, colonoscopy. These endoscopic images contains the relevant artifacts to our project. Additionally, this dataset includes the groundtruth bounding box and segmentation labels of these artifacts.

Figure 1. Example artifact segmentation labels from EAD

EndoSTTN

Endoscopic Spatial-Temporal Transformer Network [2] is a transformer-based deep learning approach designed specifically for specular highlight restoration. It is able to capture both spatial details and temporal dependencies in video sequences, enabling it to effectively restore specular highlights in scenes.

As shown in Figure 1, EndoSTTN first works by performing specular highlight detection using classical CV methods. It then preprocesses this mask by translating all of the pixels, and then uses both the RGB image and binary mask to train the temporal GAN.

Figure 2. EndoSTTN System Diagram

The overall loss function that is used in EndoSTTN is

where hole and valid loss are simply L1 regression loss for regions of interest. The hole region represents the area where the pixels are missing, and the valid region is the area where we have known pixels.

DDRM

Denoising Diffusion Restoration Models [3] utilize diffusion probabilistic modeling to restore degraded images by iteratively denoising from a learned prior distribution. Specifically tailored to handle structured noise and degradation, DDRM conditions the diffusion process on observed degraded images, guiding it towards producing high-quality restorations.

Based on the input degradation type, DDRM learns a degradation operator in the form of

where y is the final degraded image, H is a known linear degradation matrix, z is additive Gaussian noise, and x is the image we are trying to recover. It uses this formulation and a pre-trained DDPM model to iteratively denoise the input image.

In our case, we had to input a binary specular highlight mask to compute the H matrix. This essentially informs DDRM which pixels we want to restore. We then trained a DDPM model using [4]. With this, we were able to leverage DDRM to inpaint specular highlights.

Figure 3. DDRM System Diagram

YOLO Segmentation

Although classical CV methods work great in detecting specular highlights, we want to extend our inpainting method to other artifact types like fragments and surgical equipment. Since these artifacts also need inpainting, we need a more generalized way to inform our models which pixels to restore.

To address this, we finetune a pretrained YOLOv11 model with the EAD segmentation dataset. The results of our finetuning are shown in Results.