- Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
· A pioneer text-to-image benchmark Involving human preferences
· Limitations: excessive noise and lack of compositional information
2. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
· Rich compositional information considered for text-to-image generation.
· VQA formula employed for text-to-visual evaluation, yet lacks generalization.
3. When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?
· Critical compositional knowledge, including “Order,” introduced for VLM evaluation
· Limitations: not informative enough, questions easily solved without visual information

4. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
· Advanced faithfulness evaluation for text-to-visual generation with VQA models
· Question filtering makes it non-end-to-end and challenging to use for rewarding fine-tuning in generative models.
5. Towards a Better Metric for Text-to-Video Generation
· Transfer alignment evaluation from text-to-image to text-to-video generation.
· Accuracy calculation remains at post-decoder levels, neglecting the knowledge contained in logits.
