Experiments

Models Studied

To ensure broad and representative coverage of modern monocular depth estimation approaches, our evaluation pipeline incorporates several high-performing models spanning different training regimes and supervision strategies. We include MiDaS [22], a widely adopted method trained across heterogeneous datasets; among its available configurations, we select the highest-capacity variant to maximize performance. We also evaluate ZoeDepth [2], which explicitly bridges relative and metric depth prediction, using a model refined on the NYU [26] indoor dataset and the KITTI [8] driving benchmark.

In addition, we consider two large-scale models from the DepthAnything family. DepthAnything V1 [30] leverages large collections of unlabeled images to improve robustness, while DepthAnything V2 [31] follows a two-stage training strategy—pretraining on 500K synthetic images followed by large-scale adaptation on 62 million real-world samples—yielding strong generalization across domains.

Our system is intentionally designed with modularity in mind: new depth estimation methods can be integrated through a common interface with minimal effort, enabling systematic comparison across architectures, data sources, and training objectives.

Other Experiements

Random Sampling

Our preliminary results will cover some of the results from random sampling on both the 3D-FRONT dataset as well as our own. The 𝛿₁ and AbsRel metrics are evaluated for Depth Anything v2 across different poses, with the graph displaying the mean and standard deviation at each pose under randomly sampled rotations. Notably, the 𝛿₁ score drops by approximately 25% relative to its average value, while the AbsRel score experiences a similar decline of about 29%, indicating a significant degradation in depth estimation performance with rotational variation.

Ablation on Camera Parameterization

In addition to the camera parameterization adopted in our main methodology, we conduct an ablation study to analyze the impact of different rotation representations on optimization stability and model performance. Specifically, we compare our chosen parameterization against alternatives based on quaternions and Lie algebra representations.

These parameterizations differ in their geometric properties and optimization behavior. Quaternions provide a compact and continuous representation of 3D rotations but require explicit normalization to maintain unit length during optimization, which can introduce additional constraints. Lie algebra representations, while minimal and well-suited for incremental updates, may exhibit local linearization effects that influence convergence when large rotations are present. By contrast, our selected parameterization offers a smoother optimization landscape for the update process used in our framework.

Through this ablation, we systematically evaluate convergence behavior, numerical stability, and final task performance across these representations, allowing us to isolate the effect of camera parameterization from other modeling choices. This analysis provides insight into how rotation representation influences both optimization dynamics and downstream depth model failure discovery.