Experiment and Results

We evaluate two complementary strategies for improving multi-image visual linking and reducing hallucinations in large vision–language models (VLMs): Attribute Visual Prompting (inference-time grounding) and Contrastive-Learning-based Fine-tuning (representation-level alignment). Experiments are conducted on VLM2-Bench, with both coarse task accuracy and fine-grained attribute-level metrics designed to explicitly capture hallucination behaviors  .

Attribute Visual Prompting

Experimental Setup

Attribute Visual Prompting aims to improve visual linking by explicitly decomposing objects into verifiable visual attributes (e.g., color, text, number, logo, sole). For each multi-image sequence, we:

  1. Identify salient attributes relevant for distinguishing instances.
  2. Sub-group images at the attribute level, prompting the model to reason about consistency across subsets.
  3. Aggregate attribute-level judgments to produce the final answer.

We evaluate three prompting variants:

  • Textual Attribute Prompting only
  • Visual Attribute Prompting only
  • Joint Attribute-aware Text + Visual Prompting

All variants are tested under identical model backbones and decoding settings.

Quantitative Results

To evaluate the effectiveness of Attribute Visual Prompting, we conducted comprehensive experiments on the object-centric tasks of VLM2-Bench, which include comparison, counting, and grouping subtasks. These tasks explicitly require linking visual cues across multiple images and are therefore well suited for assessing grounding and hallucination behaviors.

We report performance using two complementary metrics. The first is the overall accuracy on the original VLM2-Bench VQA samples, reflecting end-task correctness. The second is a fine-grained attribute-level accuracy, computed using our proposed metric, which evaluates whether the model correctly reasons about specific visual attributes (e.g., color, text, number) across image subsets. This fine metric is designed to more directly capture hallucinations that may not be exposed by coarse task accuracy alone.

For each evaluated model, we separately introduce textual attribute prompting and visual attribute prompting, allowing us to compare their relative effectiveness under identical settings. This controlled setup isolates the contribution of each prompting modality.

Overall, the results reveal that attribute-based prompting does not consistently improve performance on VLM2-Bench. Only approximately 25% of model–task combinations show improvements, while the remaining cases exhibit noticeable performance degradation. In many settings, the drop in accuracy is substantial. Moreover, textual prompting generally outperforms visual prompting, suggesting that injecting attribute information through language is less disruptive than modifying visual attention.

These findings indicate that naively adding attribute prompts, does not reliably enhance multi-image reasoning and, in some cases, may interfere with the model’s existing visual representations.

Qualitative Analysis

To better understand the mixed quantitative results, we examine representative qualitative examples that illustrate both the strengths and limitations of Attribute Visual Prompting.

In a successful case, visual attribute prompting helps guide the model’s attention toward discriminative details. For a comparison question asking which images depict the same object, most models fail under textual prompting alone, with only one model producing the correct answer. When visual prompts explicitly highlight key attributes—such as title text and cover design—the model is better able to attend to relevant regions and correctly match images. This example demonstrates that visual attribute prompting can be effective when the highlighted cues are both accurate and discriminative.

However, we also observe failure cases where visual prompting degrades performance. In one such example involving cups with different appearances, textual prompting leads most models to focus on the artwork printed on the cup, incorrectly pairing images based on superficial similarity. Visual attribute prompting further amplifies this bias by obscuring other critical cues such as color and shape. As a result, even models that were previously correct under textual prompting fail once visual prompts are introduced. This suggests that poorly targeted visual prompts can dominate attention and suppress alternative evidence needed for correct reasoning.


Beyond individual examples, our analysis reveals three broader limitations of the current prompting approach:

  • Prompt or noise trade-off. Visual prompts, such as attention heatmaps, can introduce substantial noise into image features. Without fine-tuning, it remains unclear whether multimodal large language models can effectively reason when such noisy signals are injected at inference time.
  • Inaccurate attention heatmaps. Models such as LLaVA-1.5 often fail to consistently attend to the correct regions of interest when attribute-level prompts are introduced. This issue is particularly severe for fine-grained local attributes, suggesting that these models are not explicitly trained for stable attribute-level visual grounding across images.
  • Inaccurate or weakly discriminative attribute pools. Automatically selected attributes are sometimes insufficient for distinguishing instances. In addition, global attributes (e.g., overall shape or layout) and local attributes (e.g., logos or text) exhibit different roles in visual linking, yet current prompting strategies do not explicitly model this distinction.

Contrastive-Learning-based Fine-tuning

Quantitative Results

We evaluate the effectiveness of our contrastive-learning-based fine-tuning paradigm using Qwen3-VL-8B and Qwen3-VL-8B-thinking as base models. Both models are fine-tuned on our constructed multi-view, attribute-aware dataset and evaluated on VLM2-Bench, covering general, object-centric, and person-centric visual linking tasks.

For the non-thinking Qwen3-VL-8B, fine-tuning yields only marginal improvements across most subtasks, with overall accuracy remaining largely unchanged. This suggests that representation-level alignment alone is insufficient to substantially improve performance when the model lacks an explicit mechanism for structured reasoning over visual cues.

In contrast, the thinking-enabled variant exhibits a clear and consistent performance gain after fine-tuning. Qwen3-VL-8B-thinking-SFT achieves higher accuracy across multiple subtasks, particularly in comparison and grouping, leading to a notable increase in overall performance. This improvement indicates that contrastive fine-tuning is most effective when coupled with an explicit reasoning process that can leverage the learned visual correspondences.

These results highlight two key insights. First, the proposed contrastive training paradigm successfully strengthens cross-image visual alignment, but its benefits are realized primarily when the model is capable of structured, multi-step reasoning. Second, they underscore the importance of reasoning-aware training and inference mechanisms in multi-image understanding tasks. Motivated by these findings, future work will explore reinforcement learning–based approaches to further guide and stabilize the reasoning process during both training and inference.