Overview: How to Achieve Object Permanence?

Annotated data is critical for achieving object permanence in detection and tracking algorithms. Although partially occluded objects are labeled in some existing datasets with crowded scenes, the annotations of fully occluded objects or out-of-frame objects are normally ignored. To address this issue, we propose the creation of a new real-world benchmark that includes ground truth boxes for fully invisible and partially out-of-frame objects. Additionally, the development of a new amodal detection algorithm that can reason about heavy occlusions is essential for achieving accurate and robust amodal detection and tracking.
Our work has made two key contributions:
- We built the largest real-world amodal detection benchmark, TAO-Amodal dataset.
- We created a new amodal detection/tracking algorithm that could reason heavily or even fully occluded objects.
TAO-Amodal: A Large-Scale Real-World Amodal Detection Benchmark

We have curated TAO-Amodal, an amodal version of the TAO dataset. This means that for every (partially or completely) occluded object, bounding box annotations have been modified or added to label occluded objects to their full extent even when they are
completely occluded. This covers both in-frame and out-of-frame occlusions.
Our newly introduced dataset, TAO-Amodal, is a comprehensive resource consisting of over 2900 videos, 800+ object categories, and 700k+ amodal bounding boxes. This dataset includes helpful attributes such as object visibility, which can be used to estimate the level of occlusion. Notably, our dataset provides annotations for fully-occluded, partially-occluded, and out-of-frame objects. With the diversity of categories and complete annotations of occluded objects (including more than 36% partially out-of-frame boxes), we believe that the TAO-Amodal dataset will significantly contribute to the research community in the field of object detection and tracking.
Proposed Amodal Detection Baseline
Temporal-Aware Detection

Temporal information is essential in inferring occluded objects. Inspired by CenterTrack, we aim at developing a temporal-aware detector that could utilize detection results from previous frames. The detector takes the current frame, previous frame, and previous detection results as input, and outputs the detection results in current frame. The detector has the potential to detect objects even when they are heavily occluded by utilizing information from the previous frame.
Handle occlusion with depth estimation

(a) Frame t − 1 has active tracks {1, 2, 3, 4}, each with an internal state of its 2D position, size, velocity, and depth. (b) We forecast tracks in 3D for frame t. (c) Tracks are matched to observed detections at t using spatial and appearance cues. Matched tracks are considered visible (e.g. 1, 3). Tracks which don’t match to a visible detection (e.g. 2, 4) may be occluded, or simply incorrectly forecasted. (d) To resolve this ambiguity, we leverage depth cues from a monocular depth estimator, to compute (e) the freespace horizon. The region between the camera and the horizon must be freespace, while the area beyond it is unobserved, and so may contain occluded objects. Tracks lying beyond the freespace horizon are reported as occluded (e.g. 2). Tracks within freespace (e.g. 4) should have been visible, but did not match to any visible detections. Hence, we assume these tracks are incorrectly forecasted, and we delete them.