Overall Pipeline
In this work, we propose a training-free pipeline for text-guided medical image segmentation. Instead of relying on manually annotated pixel-level labels or additional model fine-tuning, we require minimal manual effort, by allowing the users to provide a natural language prompt like ‘segment the optic disc’ to obtain the desired medical segmentation masks. In our pipeline, we use tunable test-time parameters, a grounding model with reasoning capabilities (CogVLM [3]), a segmentation model (SAM [1]), and a validation model to iteratively refine our segmentation results without needing any ground truth values with Bayesian optimization.
Bayesian Optimization


We apply Bayesian Optimization[2] to search for the optimal configuration of LTAs that maximizes the validator’s score. The table above displays the operations we used to help achieve the best results.
Proxy Validation
We perform a proxy validation with evaluation with zero-shot classification and image-text matching.
For pseudo evaluation with zero-shot classification, we start by creating a test image that keeps only the region indicated by the predicted segmentation mask. To evaluate whether this region matches the intended anatomical structure, we prompt a general LLM with a template that includes a description of the target region and a description of the full image context. The LLM outputs several contrastive text labels. These labels, together with the test image, are fed into a vision-language model such as BioMedCLIP [4]. The model performs zero-shot classification, and the probability assigned to the target description is used as the zero-shot score.
Zero-shot classification focuses on medical terminology, but it does not check whether the segmented region actually looks correct. To capture visual characteristics such as color, shape, or texture, we add a second evaluation method based on image–text matching. We prompt an LLM to generate descriptive sentences about how the target region is expected to appear. The same vision-language model is then used to compute the similarity between the test image and each of these descriptions. The average similarity becomes the image–text matching score.
The final validation score for a predicted mask is obtained by combining the zero-shot classification score and the image–text matching score.
Citations
[1] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In ICCV, 2023.
[2] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In NeurIPS, 2012.
[3] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for pretrained language models. NeurIPS, 2024.
[4] Zhang, H., Li, Y., Li, Y., Tao, C., Zhang, T., Zhang, Y., Wang, Y., Gao, P., & Chen, W. (2023). BioMedCLIP: A Vision–Language Foundation Model for Biomedical Applications. arXiv:2303.00915.

