Dataset

HD-EPIC

The HD-EPIC dataset is an ego-centric long-video dataset focused on actions performed in the kitchen. It features long, in-the-wild recordings paired with dense annotations, including manually corrected narrations with precise action boundaries. In our project, since we focus on video question answering, we only use the video modality.

HD-EPIC also provides a VQA benchmark built from its dense labels, covering seven question types: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. The benchmark is structured as 5-way multiple-choice QA, generated from 30 question prototypes for a total of 26,650 questions, with hard negatives sampled from the dataset to increase difficulty.

References

[1] Perrett, T., et al. (2025). HD-EPIC: A Highly-Detailed Egocentric Video Dataset. arXiv2502.04144. https://arxiv.org/abs/2502.04144