RELATED WORK - Alignment for Vision-Language Foundation Models

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

· A pioneer text-to-image benchmark Involving human preferences

· Limitations: excessive noise and lack of compositional information

2. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

· Rich compositional information considered for text-to-image generation.

· VQA formula employed for text-to-visual evaluation, yet lacks generalization.

3. When and Why Vision-Language Models Behave Like Bags-Of-Words, and What to Do About It?

· Critical compositional knowledge, including “Order,” introduced for VLM evaluation

· Limitations: not informative enough, questions easily solved without visual information

4. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

· Advanced faithfulness evaluation for text-to-visual generation with VQA models

· Question filtering makes it non-end-to-end and challenging to use for rewarding fine-tuning in generative models.

5. Towards a Better Metric for Text-to-Video Generation

· Transfer alignment evaluation from text-to-image to text-to-video generation.

· Accuracy calculation remains at post-decoder levels, neglecting the knowledge contained in logits.