METHOD

VQAScore

Given an image and text, we calculate the probability of a “Yes” answer to a simple question like “Does this figure show ‘{text}’? Please answer yes or no.”

VQA Model for Text-to-Visual Evaluation

Evaluate Text-to-Visual Generation with VQAScore

Enhance VQAScore on video by considering the spatial and temporal relations: 

GenAI-Benchmark

For robust text-to-visual generation evaluation, we collected 1,600 challenging real-world text prompts sourced from professional designers and a total of 38,400 human alignment ratings.

Towards a Better T2V Evaluation Metric with VQAScore and GenAIBench

  • Ranking a few candidate images with VQAScore and selecting the highest-scoring one.
  • Setting a benchmark for ranking by collecting 43,200 human ratings.

Cinematic T2V Benchmark

Improved Alignment with Camera Components in Text-to-Video Models

  • Developing a benchmark to enhance the understanding of camera components.
  • Incorporating shot composition, camera movements, and lighting effects, enabling generative models to produce visually consistent and realistic videos.