Overview
Our method focuses on reducing hallucination in Vision-Language Models (VLMs) when processing multi-image visual question answering (VQA) tasks. We propose a novel technique called Attribute Visual Prompting, which enhances cross-image reasoning by linking consistent visual cues through attribute-level analysis.
Model Architecture Overview

Technical Components
- Dynamic attribute pool: A CLIP-based textual prompt learner designed to retrieve embedded attributes from images. It provides a pool of candidate attribute tokens optimized for image classification tasks.
- Attribute-aware text prompt: Insert those attribute tokens into the question to steer the LLM toward verifiable details.
- Attribute-aware visual prompt: We integrate attention-based visual prompting to help large VLMs focus on relevant regions of multiple images in response to a text query.
- Token-level fusion: Feed visual, attribute, and question tokens together for grounded, hallucination‑free answers.
- Zero-shot and Training-Free: Our framework does not require retraining on specific datasets and is adaptable to various multi-image VQA benchmarks.
Related Work
ATPrompt introduces an attribute-embedded textual prompt learning method for vision-language models (VLMs). By embedding multiple fixed universal attribute tokens into learnable soft prompts, ATPrompt expands the learning space from a one-dimensional category level to a multi-dimensional attribute level. This approach enhances the alignment between images and unknown categories, improving the model’s generalization capabilities.

API (Attention Prompting on Image) proposes a novel prompting technique that overlays a text-query-guided attention heatmap onto the original input image. This method enhances the performance of large vision-language models by directing attention to regions of the image relevant to the question, effectively improving visual grounding and spatial attribute reasoning.
Comparison with Existing Methods
| Method | LLM-Based | Attribute-Centric | Multi-Image | Training-Free |
|---|---|---|---|---|
| ATPrompt | ❌ | ✅ | ❌ | ❌ |
| API | ✅ | ❌ | ❌ | ✅ |
| Ours | ✅ | ✅ | ✅ | ✅ |
Our model uniquely supports multi-image reasoning, is attribute-centric, and requires no additional training.
Citation Li, Z., Song, Y., Zhao, P., Cheng, M. M., Li, X., & Yang, J. (2024). ATPrompt: Textual Prompt Learning with Embedded Attributes. arXiv preprint arXiv:2412.09442. Yu, R., Yu, W., & Wang, X. (2024, September). Attention prompting on image for large vision-language models. In European Conference on Computer Vision (pp. 251-268). Cham: Springer Nature Switzerland.
