Our project relies on the techiniques in the field of object detection, motion calibration and vanishing point detection. We will introduce the related work in these three field seperately.
Object Detection
- Detecting Twenty-thousand Classes using Image-level Supervision: This paper introduces a novel object detection model that leverages image-level annotations to recognize an extensive array of object classes. Instead of relying on bounding-box annotations, this approach employs weak supervision to identify over 20,000 classes using readily available image-level labels. The model demonstrates remarkable flexibility and scalability due to its ability to handle a vast number of classes while minimizing the need for precise annotations, leading to efficient deployment in our project.
- Segmentation Anything: This paper presents a groundbreaking segmentation model called the Segment Anything Model (SAM), which is capable of segmenting any object in an image with minimal human input. The model is designed to work across a wide range of object categories and scenarios, offering a flexible and adaptable tool for object segmentation. It also allows users to interactively refine segments or target specific objects through points, boxes, or text prompts. These capabilities are particularly beneficial for extracting detailed information about a baseball bat, as well as its spatial relationship with the batter and catcher, offering valuable insights for our keyframe detection.
Motion Calibration
- Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs: This paper explores an advanced approach for identifying specific actions within lengthy, unedited video streams. It introduces a multi-stage CNN architecture that processes videos in phases to enhance the accuracy of action localization. This approach exemplifies effective motion localization in videos and has inspired the design of our multi-stage keyframe detection module.
- Rethinking the Faster R-CNN Architecture for Temporal Action Localization: This paper presents an innovative adaptation of the Faster R-CNN framework, traditionally used for spatial object detection, to address the challenges of temporal action localization in video content. This adaptation modifies the Faster R-CNN to detect actions within a temporal context rather than static objects in an image, offering a robust solution for analyzing dynamic events in videos. Inspired by this approach, our keyframe detection module has been designed to incorporate the temporal context of each keyframe candidate, enhancing the accuracy of our final keyframe predictions.
Vanishing Point Detection
- Deep Learning for Vanishing Point Detection Using an Inverse Gnomonic Projection: This paper explores an advanced approach to vanishing point detection that integrates deep learning with geometric transformations. This method enhances the accuracy and robustness of vanishing point detection across a variety of scenes, including complex urban environments where multiple vanishing points and extensive perspective lines are present. It allows for precise detection of vanishing points in broadcast videos.
- Deep vanishing point detection: Geometric priors make dataset variations vanish: This paper introduces an innovative deep learning framework that enhances vanishing point detection by incorporating geometric priors into the model. This method significantly improves the model’s ability to generalize across different datasets by using these priors to stabilize and guide the learning process. This approach ensures robust detection of vanishing points even in highly varied visual data, making it possible to calculate the correct hitting angle in amateur videos.