Methods

Overview

Our method focuses on reducing hallucination in Vision-Language Models (VLMs) when processing multi-image visual question answering (VQA) tasks. We propose a novel technique called Attribute Visual Prompting, which enhances cross-image reasoning by linking consistent visual cues through attribute-level analysis.

Model Architecture Overview

Technical Components

  • Dynamic attribute pool: A CLIP-based textual prompt learner designed to retrieve embedded attributes from images. It provides a pool of candidate attribute tokens optimized for image classification tasks.
  • Attribute-aware text prompt: Insert those attribute tokens into the question to steer the LLM toward verifiable details.
  • Attribute-aware visual prompt: We integrate attention-based visual prompting to help large VLMs focus on relevant regions of multiple images in response to a text query.
  • Token-level fusion: Feed visual, attribute, and question tokens together for grounded, hallucination‑free answers.
  • Zero-shot and Training-Free: Our framework does not require retraining on specific datasets and is adaptable to various multi-image VQA benchmarks.

Related Work

ATPrompt introduces an attribute-embedded textual prompt learning method for vision-language models (VLMs). By embedding multiple fixed universal attribute tokens into learnable soft prompts, ATPrompt expands the learning space from a one-dimensional category level to a multi-dimensional attribute level. This approach enhances the alignment between images and unknown categories, improving the model’s generalization capabilities.

API (Attention Prompting on Image) proposes a novel prompting technique that overlays a text-query-guided attention heatmap onto the original input image. This method enhances the performance of large vision-language models by directing attention to regions of the image relevant to the question, effectively improving visual grounding and spatial attribute reasoning.

Comparison with Existing Methods

MethodLLM-BasedAttribute-CentricMulti-ImageTraining-Free
ATPrompt
API
Ours

Our model uniquely supports multi-image reasoning, is attribute-centric, and requires no additional training.


Citation
Li, Z., Song, Y., Zhao, P., Cheng, M. M., Li, X., & Yang, J. (2024). ATPrompt: Textual Prompt Learning with Embedded Attributes. arXiv preprint arXiv:2412.09442.
Yu, R., Yu, W., & Wang, X. (2024, September). Attention prompting on image for large vision-language models. In European Conference on Computer Vision (pp. 251-268). Cham: Springer Nature Switzerland.