Authors: Dongyu Yao, Keling Yao, Harry Chi; Advisor: Prof. Katia Sycara and Dr. Yaqi Xie
Multi-image Hallucination
Hallucination means LLM or LVLM generates content that is inconsistent with fact, where facts come from provided images, language prompts, or existing knowledge bases. For instance, if we let an LVLM like InternVL-2.5-8B describe this jersey in detail, it gives a very accurate response, including color, number, and text.
Figure 1: Single-image VQA of InternVL-2.5-8B. Text Prompt: “Describe this image in detail”. Response: The shirt is a purple basketball jersey with the number 23 prominently displayed in yellow with a purple outline on the front. The word “LAKESHOW” is printed above the number in yellow capital letters.
However, suppose we put this image in an image sequence and ask the model to choose which images show the same shirt in the input images, although InternVL-2.5 can correctly describe the first and third one, it wrongly recognizes the 2nd jersey as white color with text “Golden State”, thus incorrectly classifying the 2nd and 3rd shirts as the same category. Therefore, in the context of multi-image tasks emphasizing perceiving and linking visual cues, the wrong perception of details when comparing across multiple images indicates hallucination.
Figure 2: Text prompt: Which images show the same shirt in the input images? Choices: A) 1 and 2; B) 2 and 3; C) None; D) 1 and 3. Response: To determine which images show the same shirt, I need to compare the jerseys in the images. The first image shows a purple “LAKE SHOW” jersey with the number 23. The second image shows a white “Golden State” jersey with the number 23. The third image shows a white “Golden State” jersey with the number 23. Since the second and third images both show the “Golden State” jersey, the correct answer is B) 2 and 3. MY ANSWER: B) 2 and 3.