EndoSTTN
To set up the EndoSTTN training, we first organized the RGB images into its own video sequences. We then run specular highlight detection to obtain binary masks for each frame. We finetune a pretrained EndoSTTN model, which was trained on the HyperKvasir dataset, for an additional 110k iterations. We train this model using an NVIDIA GeForce RTX 2080 Ti graphics card.
DDRM
Before sampling from DDRM, we first need to train a DDPM model on clean, good quality EAD images. We set the model to take 4000 diffusion steps, and we ran it for 300k iterations. Once we have a pretrained DDPM model, we can use the binary specular highlight mask to restore our input images.
Inpainting Results
The results of EndoSTTN and DDRM inpainting are shown below. Since we do not have a groundtruth image, i.e. the “clean” version of our images, we need to use a single image quality assessment. We will use the Signal-to-Noise ratio metric, which represents the strength of the desired signal relative to the noise in the image. A higher SNR indicates a clearer image with less noise.


Failure Cases
A notable failure case for both EndoSTTN and DDRM are when we try to inpaint specularity highlights with a large area. While both models perform generally well with small specks, they struggle when the area of light saturation becomes large.

YOLOv11 Segmentation Finetuning
We finetune the YOLOv11 segmentation model with EAD data. The images in the EAD dataset have variable sizes, and YOLOv11 can only operate on image sizes that are to the power of 2. Therefore, we created several 64×64 crops of each image to prepare our dataset. We train the model for 200 epochs. The qualitative results are shown below.
