Dataset

EgoSchema

The EgoSchema benchmark contains over 5000 very long-form video language understanding questions spanning over 250 hours of real, diverse, and high-quality egocentric video data. Many videos feature complex scenes with cluttered rooms, where we could take full potential of scene graphs.

EgoSchema dataset QA example

NExT-QA

NExT-QA is a VideoQA benchmark targeting the explanation of video content. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. NExT-QA contains 5,440 videos and about 52K manually annotated question-answer pairs, grouped into causal, temporal, and descriptive questions. The videos are 35 seconds long on average and focus on interactions between people and sometimes pets.

NExT-QA examples and question category statistics

References

[1] Mangalam, K., Akshulakov, R., & Malik, J. (2023). EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding. arXiv:2308.09126. https://arxiv.org/abs/2308.09126

[2] Xiao, Junbin, et al. “Next-qa: Next phase of question-answering to explaining temporal actions.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.