Our approach intends to experimentally evaluate the robustness of monocular depth estimation (MDE) models. Specifically, we seek to develop a diagnostic toolbox that can allow you to systematically sample failure modes for common MDEs. This objective can be effectively seen below.

Baselines and Dataset Generation

Random Sampling Baseline

To implement our algorithm, we use the 3D-FRONT dataset. For each room in the dataset, we randomly sample 10 viewpoints. From each viewpoint, we further sample 50 camera rotations, ensuring that each rendered view includes at least three distinct semantic classes. To avoid unnatural or invalid camera poses, we enforce the following constraints:

The camera must be positioned at least 2 units away from any nearby wall to prevent clipping.
The nearest mesh surface must be at a distance of at least 1.5 times the camera’s near-clipping plane.
All camera rotations are constrained within ±20 degrees from the initial orientation.

Additionally, to maintain semantic richness in the rendered views, we apply the following filtering criteria based on the semantic labels provided by 3D-FRONT:

No more than 50% of the pixels in any render may belong to the “wall” or “floor” classes.
Each render must contain more than three distinct semantic classes.

These constraints ensure that our random sampling process produces semantically meaningful and visually diverse training data.

For each sample, we render the depth through both Blender as well as the MDE (DepthAnythingv2 in this case) and evaluate the delta-1, AbsRel and L2 losses.

Dataset Generation

While datasets such as 3D-FRONT and Hypersim exist, there remains a notable lack of diversity in synthetic indoor datasets that are compatible with differentiable renderers like PyTorch3D. To address this gap, we construct a small but diverse dataset of 10 texture-rich indoor scenes. Our pipeline integrates Blender and PyTorch3D, enabling us to bake high-quality textures in Blender into a single 8K texture map, while ensuring compatibility with PyTorch3D’s differentiable rendering framework.

Our process is as follows:

Convert publicly available 3D assets originally made for Blender into formats compatible with PyTorch3D.
Merge all objects in each scene into a single mesh.
Generate a unified UV map using Blender’s Smart UV unwrap.
Bake all object textures into a single 8K texture map using Blender’s Cycles Renderer.
Load the processed mesh and texture into PyTorch3D while preserving spatial relationships.
Create a repository of converted 3D assets for direct rendering in PyTorch3D.
Streamline workflows for 3D rendering and analysis using the converted assets.

Current Approach

Beyond the baselines that we have sampled, our current approach is straightforward. Here, we use a differentiable rendering pipeline to optimize the camera pose based on the z-buffer depth prediction.

Additionally, for the objective function, we incorporate a perceptual loss rather than relying solely on traditional pixel-wise losses like L1 or L2. The motivation behind this choice stems from our primary goal: we are more concerned with distributional inconsistencies in depth perception—such as incorrect relative depth ordering—than with achieving perfect numerical agreement on an absolute scale. Perceptual loss operates in a feature space (typically extracted from a pretrained network such as VGG) and captures higher-level structural and semantic differences between the predicted and ground truth depth maps. Our current objective is defined as below:

Overall, this approach combines the advantages of various approaches as shown below.