Evaluation Metrics

However, the original VLM2-Bench only evaluates LVLMs with high-level questions like multi-choice, yes/no, and counting. These questions are too general to closely assess whether the model can truly understand the visual cues and link them properly, so we need more fine-grained inspections on these images to evaluate hallucination.

In the field of hallucination evaluation, there is a very popular paper called POPE, Polling-based Object Probing Evaluation (Li et al., EMNLP 2023), which was proposed to evaluate object-level hallucination of single images. It generates ground-truth objects in the image via an off-the-shelf segmentation network and human annotation, and samples non-existent objects via negative sampling from a pre-defined object pool. After getting ground-truth and non-existent objects, a bunch of yes/no questions are generated based on each object. In this way, we can easily evaluate whether an LVLM truly understands the image content in a relatively easier manner, because we don’t need to come up with complicated semantic parsing algorithms to deal with open-ended responses

Figure 1: The process of constructing the hallucination metric of POPE, targeting the task of single-image object-level hallucination of LVLM.

Inspired by this paper, we propose our metric on the VLM2-Bench dataset. Given an image sequence, we first manually extract the key attributes to distinguish between images, like color, text, and number, and then form yes/no questions following the template. The ground truth answer is annotated by humans. By doing this, each image sequence now has a fine-grained question set Q

Figure 2: Pipeline for constructing fine-grained yes/no questions on VLM^2-Bench.

When evaluating LVLM, we need to calculate a hallucination score for each image sequence, and the overall score of the whole task, like counting or multi-choice, is just the summation over all s. To do this, we first ask the coarse-grained question, like “How many distinct shirts are there in these images?”. If the answer is incorrect, the hallucination score of this image sequence is 0 because it cannot even answer the high-level question correctly. Otherwise, we go through each fine-grained question, answer it one by one, and calculate the overall accuracy as the hallucination score.

Figure 3: Pipeline for calculating the hallucination score of each image sequence.