
Driven by this goal, we build upon VLM²-Bench (https://vlm2-bench.github.io/), a benchmark developed in prior work by one of our authors, Dongyu Yao. VLM²-Bench is specifically designed to evaluate whether Vision-Language Models can visually link matching cues across multiple images and videos.
To comprehensively assess this ability, the benchmark covers three major categories of visual cues:
- General Cues (GC)
- Object-centric Cues (OC)
- Person-centric Cues (PC)
It includes 9 sub-tasks, featuring both multi-image sequences and video-based scenarios, and comprises a total of 3,060 VQA test cases, providing a thorough examination of VLMs’ core visual linking capabilities.

To contextualize model performance, this benchmark introduce two baselines:
- Chance-Level, representing random guessing
- Human-Level, reflecting natural visual linking ability
As shown in Table 1, humans find VLM²-Bench tasks relatively easy. However, most state-of-the-art models not only fall far short of human performance, but in many cases perform worse than random guessing. This performance gap is especially pronounced in the VID task, which requires tracking and describing people across video frames. Models frequently mistake different individuals as the same person or fail to recognize reappearing individuals.
Interestingly, models show relatively better performance on Person-centric Cues (PC) compared to Object-centric Cues (OC). We hypothesize this is due to the textual anchors available in PC tasks—such as proper names—which offer strong and consistent visual associations. In contrast, OC tasks often involve generic category labels (e.g., “bag”, “bottle”), which offer weaker anchoring and make fine-grained object linking much harder.
🔍 Key Findings from VLM2-Bench
- Language aids vision—but not always enough
 While language-based reasoning can help models make logical connections, it is not sufficient for fine-grained visual matching without strong visual grounding.
- Visual prompting needs a stronger vision-side ability
 The effectiveness of visual prompts hinges on whether models can truly understand both the visual content and the prompt, not just rely on language cues.
- Person vs. Object: Not all cues are equal
 Visual prompting performs better on object-centric cues than on person-centric ones. This suggests that current models may rely more on textual anchors (e.g., names) rather than purely visual identity.
