About

Text-to-visual models, which now create realistic images and videos, face challenges with complex prompts involving attributes, relationships, and advanced reasoning. Our study on GenAI-Bench evaluates top image and video generators for these tasks. We compare automated evaluation metrics against our collected human ratings and find that VQAScore significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without fine-tuning) by simply ranking a few (3 to 9) candidate images. It is 2-3 times more effective than methods such as PickScore and ImageReward in improving human ratings for models like DALL-E 3 and Stable Diffusion.

Motivation

  • Text-to-visual models struggle with generating images involving compositional text prompts. 
  • Evaluating alignment robustly is challenging for both automatic metrics and human evaluation.
  • The automated evaluation metrics for alignment often function as a bag of words.
  • Text-to-video models struggle to understand camera aspects and 3D motions.

Goal

  • Providing reliable scores, VQAScore, to evaluate complex prompt alignment without relying on expensive human feedback.
  • Building a benchmark, GenAI-Bench, to assess essential visio-linguistic compositional reasoning skills for text-to-visual generative models and vision-language alignment metrics.
  • Creating a high-quality text-video paired dataset, incorporating aspects like shot composition, camera movements, and lighting effects.

Overview

In words, our work explores:

  1. A Reliable Alignment Metric for Text-to-Visual Models
    • We Introduced a yes/no question-based metric with FlanT5 for robust text-to-visual alignment evaluation.
  2. A Visio-linguistic Compositional Benchmark for Text-to-Image Models
    • We constructed 1,600 compositional prompts and VQAScore to rank outputs, boosting alignment validated by 43,200 human ratings.
  3. A Cinematic Benchmark for Text-to-Video Models
    • We established benchmarks to evaluate visio-linguistic reasoning, integrating shot composition, movements, and lighting for realistic outputs.