
Before we dive into the project, take a look at the above figure and answer this question: Who are they? Some faces may feel familiar, but naming them all is hard. Now, forget about names: Are they even the same individual? Most people would confidently say: No.

Now let’s jump into the second question of this series of shoes: How many distinct pairs of shoes do you see?
You might not recognize the exact Nike models, but by comparing subtle cues, like the number of logos, you’ll notice:
Image 2 stands out.
It’s a different pair.
This is what humans do effortlessly: visually link matching cues across images.
This ability to visually link matching cues across images requires no prior knowledge or labels—just pure visual reasoning.
As shown in the quizzes above, we can often identify whether two faces belong to the same person without knowing their names. Similarly, we can compare a photo of our favorite shoes to those on the shelf in a Footlocker store and quickly tell if they’re the same model.
What enables this is a simple yet powerful skill:
linking and matching visual cues across different images to judge identity and similarity.
Recently, Large Language Model-based Vision-Language Models (VLMs) have demonstrated impressive world knowledge and expanded their capabilities from understanding single images to processing multiple images and even videos.
However, increasing the number of images or context length alone does not ensure that these models can perform visual linking—the ability to identify matching visual cues across different inputs.
This remains an essential yet underexplored skill in multimodal reasoning, and one that current foundational models often struggle to master.
