Overall Pipeline
In this work, we propose a training-free pipeline for text-guided medical image segmentation. Instead of relying on manually annotated pixel-level labels or additional model fine-tuning, we require minimal manual effort, by allowing the users to provide a natural language prompt like ‘segment the optic disc’ to obtain the desired medical segmentation masks. To build this pipeline, we leverage existing segmentation models, such as SAM, and pair them with vision-language models (VLMs). This fusion enables our system to interpret and act on flexible text queries. We also introduce a visual prompt augmentation strategy that helps these models work more effectively. For example, SAM was trained on both natural images and medical images, which makes the model less sensitive to images with weak boundaries or low contrast.
Contrast Augmentation (CLAHE)
We tried the images on different augmentation methods, and we found that CLAHE (Contrast Limited Adaptive Histogram Equalization) matches our goals the most. This method divides the images into smaller parts and adjusts the contrast of each part individually. With this, we are able to mitigate the issue that some medical images are under low lighting conditions, have low contrast, or weak boundaries, which impact SAM’s performance.

Box and Point Generation
One way we are currently using to generate bounding box and point prompts is displayed here. First, we use CogVLM to get our bounding box given a textual input. Then, we select the center of the bounding box as the anchor point, and use DinoV2 to encode the image. Next, we choose top 10 points that have the highest cosine similarity with the anchor point in features space. Finally, we use KNN to cluster the 10 points into 3 groups to obtain our 3 points for the point prompt.
KNN [1]

Cosine Similarity [1]


Citations
[1] Carleton College Computer Science Department. (2010). K-Nearest Neighbor (KNN). https://cs.carleton.edu/cs_comps/0910/netflixprize/final_results/knn/index.html

