Introduction - Breaking Depth Estimation Models with Semantic Adversarial Attacks

Monocular depth estimation models (MDEs) have made significant leaps in performance recently with the release of models like DepthAnythingv2. These models are capable of generating highly detailed relative depth estimations over diverse indoor and outdoor data. However, these models exhibit specific failures that occur as a result of data-biases. In this work we seek to build a system that is able to systematically find these failures. More specifically, we found that depth models are not robust to simple camera pose perturbations. While this has been studies extensively in papers like ViewFool, this has not been explored in the context of unbounded scenes as well as depth.

Specifically, as opposed to conducting an exhaustive search over all possible camera poses, we see to use adversarial approaches to find failure trajectories where the depth estimation gradually worsens over time. While approaches like random sampling see up to a 25% drop in performance, we want to develop a diagnostic system capable of finding more systematic failures. Furthermore, to unsure that these failures are semantically meaningful, we employ an updated objective function along with heuristic-driven design choices to ensure that the model always optimizes to meaningful failures.

References

[1] Zhao, Y., Kong, S., & Fowlkes, C. (2021). Camera pose matters: Improving depth prediction by mitigating pose distribution bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15759–15768). IEEE.

[2] Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024). Depth Anything V2. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024).

[3] Dong, Y., Ruan, S., Su, H., Kang, C., Wei, X., & Zhu, J. (2022). ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial Viewpoints. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022).