Evaluating the correlation of automated metrics with
human ratings on GenAI-Bench

We report higher scores in Pairwise accuracy, Pearson, and Kendall indicating better performance. VQAScore, using the CLIP-FlanT5 VQA model, achieves the strongest agreement with human ratings on images and videos, significantly surpassing metrics like CLIPScore, PickScore, and Davidsonian.

Boosting Text-to-Visual Generation with VQAScore: A Comparative Analysis

We enhanced text-to-visual generation by evaluating nine candidate images, marking performance improvements in green and declines in red. Selecting images with the highest VQAS score significantly increases human alignment ratings. Conversely, ranking by CLIPScore may yield the same or reduced performance. VQAScore is 2x to 3x more effective than methods like PickScore, which require expensive human feedback, or those that decompose texts using ChatGPT (Davidsonian). Table 3 shows performance improvements for various scoring methods across basic, advanced, and all prompts.

Enhancing DALL-E 3 Image Generation with VQAScore

Ranking DALL-E 3 generated images with VQAScore and CLIPScore reveals that VQAScore surpasses CLIPScore, especially for prompts involving attributes, relationships, and higher-order reasoning. This demonstrates VQAScore’s potential to enhance text-to-image generation solely with an image generation API. We provide detailed performance gains for VQAScore and other metrics.

Improve DALLE3 and SDXL by image ranking

We present the average human ratings of 7 popular scoring methods across basic, advanced, and all prompts on GenAI-Bench. Performance gains over the Random baseline (no ranking) are highlighted in green, while decreases are marked in red.

Alignment on Text-to-Video Models with Cinematic T2V Benchmark

We present an evaluation using our cinematic T2V benchmark, which incorporates camera comments that are often overlooked in current video captions. We help T2V models generate visually consistent and realistic videos by bringing rich captions.

Evaluating the correlation of automated metrics withhuman ratings on GenAI-Bench

Boosting Text-to-Visual Generation with VQAScore: A Comparative Analysis

Enhancing DALL-E 3 Image Generation with VQAScore

Improve DALLE3 and SDXL by image ranking

Alignment on Text-to-Video Models with Cinematic T2V Benchmark

Evaluating the correlation of automated metrics with
human ratings on GenAI-Bench