VQAScore
Given an image and text, we calculate the probability of a “Yes” answer to a simple question like “Does this figure show ‘{text}’? Please answer yes or no.”
VQA Model for Text-to-Visual Evaluation

Evaluate Text-to-Visual Generation with VQAScore

Enhance VQAScore on video by considering the spatial and temporal relations:
GenAI-Benchmark
For robust text-to-visual generation evaluation, we collected 1,600 challenging real-world text prompts sourced from professional designers and a total of 38,400 human alignment ratings.
Towards a Better T2V Evaluation Metric with VQAScore and GenAIBench
- Ranking a few candidate images with VQAScore and selecting the highest-scoring one.
- Setting a benchmark for ranking by collecting 43,200 human ratings.
Cinematic T2V Benchmark
Improved Alignment with Camera Components in Text-to-Video Models
- Developing a benchmark to enhance the understanding of camera components.
- Incorporating shot composition, camera movements, and lighting effects, enabling generative models to produce visually consistent and realistic videos.
