Skip to content

Alignment for Vision-Language Foundation Models

Students: Yixin Fei, Kewen Wu, Pengliang Ji | Advisors: Zhiqiu Lin, Deva Ramanan (Carnegie Mellon University, Robotics Institute)

  • Home
  • Related Works
  • Methods
  • Experiments
  • Team
  • Resource

RESOURCE

Fall’24

Physical Presentation Slides

Video Presentation Slides

Video Presentation

Poster

Spring’24

Physical Presentation Slides

Video Presentation Slides

Video Presentation

Poster

Reference

Hu, Yushi, et al. “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

Kirstain, Yuval, et al. “Pick-a-pic: An open dataset of user preferences for text-to-image generation.” Advances in Neural Information Processing Systems 36 (2024).

Huang, Kaiyi, et al. “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.” Advances in Neural Information Processing Systems 36 (2023): 78723-78747.

Yuksekgonul, Mert, et al. “When and why vision-language models behave like bags-of-words, and what to do about it?.” The Eleventh International Conference on Learning Representations. 2022.

Wu, Jay Zhangjie, et al. “Towards A Better Metric for Text-to-Video Generation.” arXiv preprint arXiv:2401.07781 (2024).

Proudly powered by WordPress