Methodology

On this page, we describe the datasets used to train and evaluate our models. To inpaint different endoscopy artifacts, we experimented with multiple computer vision approaches, including classical baselines and a YOLO based segmentation model to localize artifact regions.

Dataset

We use the Endoscopy Artifact Detection Challenge 2019 dataset [1] as our primary dataset. This public dataset contains endoscopic images from multiple procedures, including gastroscopy, cystoscopy, gastroesophageal examinations, and colonoscopy. The images include the artifact types relevant to our project, and the dataset provides ground truth annotations in both bounding box and pixel level segmentation formats.

Figure 1. Example artifact segmentation labels from EAD

We also use HyperKvasir and Endo4IE for training and evaluation. Both datasets consist of gastrointestinal endoscopy images (upper GI and colonoscopy), which makes them suitable for assessing model performance and generalization.

EndoSTTN

Endoscopic Spatial-Temporal Transformer Network [2] is a transformer-based deep learning approach designed specifically for specular highlight restoration. It is able to capture both spatial details and temporal dependencies in video sequences, enabling it to effectively restore specular highlights in scenes.

As shown in Figure 2, EndoSTTN first works by performing specular highlight detection using classical CV methods. It then preprocesses this mask by translating all of the pixels, and then uses both the RGB image and binary mask to train the temporal GAN.

Figure 2. EndoSTTN System Diagram

The overall loss function that is used in EndoSTTN is

where hole and valid loss are simply L1 regression loss for regions of interest. The hole region represents the area where the pixels are missing, and the valid region is the area where we have known pixels.

DDRM

Denoising Diffusion Restoration Models [3] utilize diffusion probabilistic modeling to restore degraded images by iteratively denoising from a learned prior distribution. Specifically tailored to handle structured noise and degradation, DDRM conditions the diffusion process on observed degraded images, guiding it towards producing high-quality restorations.

Based on the input degradation type, DDRM learns a degradation operator in the form of

where y is the final degraded image, H is a known linear degradation matrix, z is additive Gaussian noise, and x is the image we are trying to recover. It uses this formulation and a pre-trained DDPM model to iteratively denoise the input image.

In our case, we had to input a binary specular highlight mask to compute the H matrix. This essentially informs DDRM which pixels we want to restore. We then trained a DDPM model using [4]. With this, we were able to leverage DDRM to inpaint specular highlights.

Figure 3. DDRM System Diagram

LaMa

Large Mask Inpainting (LaMa) [5] is a CNN based image inpainting framework designed to recover missing regions under large, irregular masks. Unlike earlier patch based approaches, LaMa leverages Fast Fourier Convolution (FFC) blocks, which use spectral transforms to model long range context and improve global structure consistency while preserving local texture.

As shown in Figure 4, LaMa takes an RGB image and a binary mask as input, where the mask indicates the pixels to be inpainted. The network follows an encoder decoder pipeline with downsampling, multiple FFC residual blocks, and upsampling to predict the missing content. During training, LaMa is optimized with reconstruction losses on the masked region together with perceptual objectives to encourage realistic structure and texture.

Figure 4. LaMa System Diagram


Fast Fourier Convolution (FFC) transforms part of the feature representation into the frequency domain, enabling the model to capture global context and maintain overall structure and texture consistency. Multi scale context fusion is achieved by combining local and global branches, allowing the model to handle both fine grained details and high level semantics. In our setting, we use LaMa as a single image baseline for artifact removal by providing the predicted artifact mask as the inpainting region.

YOLO Segmentation

Although classical CV methods work great in detecting specular highlights, we want to extend our inpainting method to other artifact types like fragments and surgical equipment. Since these artifacts also need inpainting, we need a more generalized way to inform our models which pixels to restore.

To address this, we finetune a pretrained YOLOv11 model with the EAD segmentation dataset. The results of our finetuning are shown in Results.