Methods

Overview

Our method focuses on reducing hallucination in Vision-Language Models (VLMs) when processing multi-image visual question answering (VQA) tasks. We propose two novel techniquesAttribute Visual Prompting and Contrastive-learning-based Fine-tuning.

Attribute Visual Prompting

Overview of Methods

Our method adopts a divide-and-conquer fusion strategy to guide VLMs toward more reliable and discriminative reasoning. The pipeline consists of three key stages:

  • Attribute Discovery
    • Identify important visual attributes that are:
      • shared across images (commonality), and
      • highly discriminative for distinguishing similar objects.
    • These attributes serve as semantic anchors for comparison.
    • Example: For shoes, attributes such as logo, sole, and laces are critical for differentiation.
  • Attribute-Level Prompting
    • Construct attribute-specific prompts to explicitly highlight the relevant image regions.
      • Each prompt directs the model’s attention to a single attribute, reducing interference from irrelevant visual cues.
      • This step decomposes a complex comparison into multiple focused sub-tasks.
  • Prompt Aggregation & VQA Formatting
    • Aggregate the attribute-level prompts into a unified representation.
    • Convert the aggregated prompts into a standardized VQA-style input.
    • The final formatted input is fed into the LLM for consistent and comparable reasoning across samples.

Detailed walkthrough

Our prompt pipeline combines a visual prompting module and a textual prompting module, which can be used independently or together.

Given a sequence of images of a known object, we first extract a compact attribute pool from a reference image, capturing the most important and discriminative visual cues.
For each attribute, we generate an attribute-aware visual prompt by using attention maps from a pre-trained LLaVA-1.5 model, guided by simple attribute-specific questions (e.g., “What is the logo of this shoe?”). These attention maps are aggregated to highlight key regions in each image.

In parallel, we construct an attribute-aware text prompt that explicitly injects the discovered attributes into the question.
Finally, visual tokens, attribute tokens, and question tokens are fused together as input to the vision-language model, encouraging grounded reasoning and reducing hallucination.

Contrastive-learning-based Fine-tuning

Motivation

The LVLM is essentially comparing the similarity and difference of entities in images between each pair of images, classifying them as the same object or person if they have similar features, and as different otherwise. So this kind of essence naturally fits a contrastive learning paradigm, which also makes similar pairs closer and separate apart different pairs.

Dataset

We found a large-scale dataset called MVImgNet of multi-view images, where each video, essentially an image sequence, contains the same object but with different camera angles. Therefore, each image pair picked from a sequence can be regarded as a positive pair, which means the objects are the same. The difference in camera position could be beneficial for the robustness of LVLM because it encourages the model to focus on invariant object features across turbulated scenes, and rich annotation of this dataset, like object masks, can help us better prepare training data.

Statistics of MVImgNet:

  • ~42k positive/negative image pairs (~80GB)
  • 238 object classes with rich object-mask annotations

Training Data Preparation

Attributed-based Editing Setup:

  • Global & local attribute pools generated via GPT-4o
  • Nano-🍌 performs image editing using randomly sampled global or local attributes

Grounding Ground Truth Setup:

  • Use Qwen3-VL-32B to localize edited regions
  • Prompt: “The <attribute> was edited by this prompt: <prompt>. Compare the images before and after editing and output the bounding box of the edited area in JSON format.”
  • Model outputs precise bounding boxes enclosing edited objects

Training Paradigm

Given a positive image pair and a negative pair, where the positive pair is two images from the same image sequence but with different camera angles, we ask the question “Compare image 1 and image 2, determine whether the main object in the two images represent the same instance”, and we expect the LVLM to output three information: chain-of-thoughts reasoning process, the bounding box of edited area, and binary classification of positive v.s. negative. All this information is trained with supervision signals that we have constructed before.


Citation
Li, Z., Song, Y., Zhao, P., Cheng, M. M., Li, X., & Yang, J. (2024). ATPrompt: Textual Prompt Learning with Embedded Attributes. arXiv preprint arXiv:2412.09442.

Yu, R., Yu, W., & Wang, X. (2024, September). Attention prompting on image for large vision-language models. In European Conference on Computer Vision (pp. 251-268). Cham: Springer Nature Switzerland.

Yu, Xianggang, et al. "Mvimgnet: A large-scale dataset of multi-view images." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023

Bai, Shuai et al. “Qwen3-VL Technical Report.” arXiv preprint arXiv:2511.21631.