Technical Report - Learning diffusion priors for video amodal segmentation

Motivation

Gestalt psychology suggests that human perception inherently organizes visual elements into cohesive wholes. When an object is occluded, humans can often infer the complete outline of the object – an ability that is developed in humans in their early years. Additionally, object permanence suggests that with some temporal context, humans can perceive objects to persist even when they undergo complete occlusions. Replicating these phenomena of gestalt psychology and object permanence in object segmentation has traditionally been ignored, as the community has focused largely on segmenting the visible or modal regions of objects (as exemplified by models like SAM). Recent focus has shifted to include amodal segmentation, which involves segmenting an object’s full shape, including both visible and occluded parts. This task has broad real-world applications, including safe navigation in robotic manipulation and autonomous driving, understanding occluder-occludee relationships in complex scenes, and enhancing advanced image and video editing tools.

Method

As shown in Figure 1, we propose repurposing a video diffusion model, Stable Video Diffusion (SVD), to achieve highly accurate and generalizable video amodal segmentation. One key insight is that foundational diffusion models trained to generate pixels also bake-in strong priors on object shape.

The first stage of our pipeline generates amodal masks {A_t} for an object, given its modal masks {M_t}and pseudo-depth of the scene {D_t} (which is obtained by running a monocular depth estimator on RGB video sequence {I_t}). The predicted amodal masks from the first stage are then sent as input to the second stage, along with the modal RGB content of the occluded object in consideration. The second stage then inpaints the occluded region and outputs the amodal RGB content {C_t} for the occluded object. Both stages employ a conditional latent diffusion framework with a 3D UNet backbone. Conditionings are encoded via a VAE encoder into latent space, concatenated, and processed by a 3D UNet with interleaved spatial and temporal blocks. CLIP embeddings of {M_t} and the modal RGB content provide cross-attention cues for the first and second stage respectively. Finally, the VAE decoder translates outputs back to pixel space.

Figure 2: Model pipeline for amodal segmentation and content completion

A key challenge with the stage 2 in our method (amodal content completion) is the lack of ground-truth RGB content in occluded regions, even in synthetic datasets like SAIL-VOS. Inspired by self-supervised training-pair construction used extensively in image amodal tasks, we extend this approach to video sequences. Figure 3 illustrates an example of a modal-amodal RGB content training pair. To construct such a pair, we first select an object from the dataset with near-complete visibility (above 95%). We then sequentially overlay random amodal mask sequences onto this fully visible object until its visibility falls below a set threshold, thereby simulating occlusion. This effectively generates ground-truth RGB data for the occluded regions.

Figure 3: Modal-amodal RGB training pair for content completion

Quantitative Results

Table 1 shows the quantitative comparisons on SAIL-VOS and TAO-Amodal, where our method surpasses all baselines. Notably, it achieves nearly 13% improvement over the second-best method, PCNet-M, in terms of mIoU_occ, highlighting effective completion of occluded object regions. Despite being trained exclusively on synthetic SAIL-VOS, a zero-shot evaluation on TAO-Amodal highlights the strong generalization of our model. We posit that, in addition to leveraging foundational knowledge and rich priors from the large-scale pretraining of SVD, our model is able to learn temporal cues that help it amodally complete any unseen object classes from neighboring frames.

Table 1: Quantitative comparison on SAIL-VOS and TAO-Amodal

Table 2 provides quantitative comparisons on the MOVi-B/D datasets, where our method beats the prior state-of-the-art. Despite strong camera motion in MOVi-B/D, our model adapts well without access to camera extrinsics or optical flow (unlike some baselines). We posit that our method is able to use the 3D priors from Stable Video Diffusion and is therefore, successfully able to maintain consistent object shapes from different view-points.

Table 2: Quantitative Comparison on MOVi-B/D

Qualitative Results

In Figure 4, we show the qualitative comparison across diverse dataset. Our method leverages strong shape priors, such as for humans, chairs, and teapots, to generate clean and realistic object shapes. It also excels in handling heavy occlusions; even when objects are nearly fully occluded (e.g., “chair” in the second row of SAIL-VOS), our method achieves high-fidelity shape completion by utilizing temporal priors. Note that TAO-Amodal contains out-of-frame occlusions which none of the methods are trained for, but our method is able to handle such cases.

Figure 4: Qualitative comparison of amodal segmentation methods across diverse datasets

In Figure 5, we show the qualitative results for content completion. Although our content completion module, initialized from pretrained SVD weights, is finetuned solely on synthetic SAIL-VOS, it achieves photorealistic, high-fidelity object inpainting even in real-world scenarios. Furthermore, our method can complete unseen categories, such as giraffes and plastic bottle, likely due to its ability to transfer styles and patterns from the visible parts of objects to occluded areas in the current or neighboring frames. We show examples from TAO-Amodal (top) and in-the-wild YouTube videos (bottom).

Figure 5: Qualitative results for content completion

In Figure 6, we highlight the lack of temporal coherence in a single-frame diffusion based method, pix2gestalt, for both the predicted amodal segmentation mask and the RGB content for the occluded person in the example shown. By leveraging temporal priors, our approach achieves significantly higher temporal consistency across occlusions.

Figure 6: Temporal consistency comparison with an image amodal segmentation method

In Figure 7, we show an example of multi-modal generation from our diffusion model. Since there are multiple plausible explanations for the shape of the person in his occluded region, our model predicts two such plausible amodal masks (with the person’s occluded legs in two different orientations).

Figure 7: An example of multi-modal generation from our diffusion model