Motivation
Medical Images, such as CT scans, MRIs, and X-rays, are commonly used in hospitals to help medical workers visualize internal structures of the human body. These images are crucial in helping the doctors understand information about the patient’s conditions. Medical image segmentation identifies organ boundaries, highlights important parts of the image, and isolates regions of interests such as the optic disc. This helps with accurate diagnosis, planning treatments, and monitoring diseases over time. Traditional ways require human labeling, which is expensive and time consuming. Most image segmentation models introduced recently require large datasets for training and pixel-level annotations [1] [2], requiring user expertise and prone to human error. In the general computer vision field, researchers have bridged language models with segmentation tasks [3]. This motivates us to explore text-based approach in segmentation within the medical field.

Image from MedSAM: “Segment Anything in Medical Images” by Ma et al., 2024
Problem Statement
Traditionally, medical image segmentation relies on pixel-level annotations and training based on massive medical images. Recent approaches in the general computer vision field have explored reasoning-based segmentation [3], given a text prompt provided by the user. With this, we identify the need for a training-free, annotation-free medical image segmentation approach that minimizes manual effort while maintaining strong performance. The problem is how to accurately segment medical images based solely on flexible text prompts, without further task-specific training. We thus incorporate existing VLMs to generate pixel-level prompts from text prompts. Additionally, existing models like SAM [4] struggle with low-contrast medical images due to their training on diverse data, and medical segmentation models like MedSAM [1] require specific input prompts. To address this, we utilize a contrast enhancement method that enables us to use flexible prompts with SAM to achieve compatible performance with segmentation models specialized for medical images.
Citations
[1] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1038/s41467-024-44824-z
[2] Wu, Junde, Jiayuan Zhu, Yueming Jin, and Min Xu. “One-Prompt to Segment All Medical Images.” arXiv preprint arXiv:2305.10300 (2024). Available at: https://arxiv.org/abs/2305.10300
[3] Wang, J., & Ke, L. (2024). LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning. arXiv preprint arXiv:2404.08767
[4] Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. “Segment Anything.” arXiv preprint arXiv:2304.02643 (2023). Available at: https://arxiv.org/abs/2304.02643
