The classical depth-retrieval pipeline, as described in Introduction and detailed in Kotwal et al. 2022, contains at its core a maximum-likelihood estimator of the phase differences in the scene. It is informed by a physics model of the capture setup and assumes a Gaussian noise distribution where errors in any given pixel and frame is uncorrelated with those in other pixels or frames. What this means is that the output of the pipeline may not be the most accurate output that could be recovered from the information contained within the frames. In particular, there are several reasons to believe that we can do better:

  • It is likely that noise is correlated between pixels, for example if it arises from areal imperfections in the capture setup.
  • We have at our disposal a high-quality RGB image of the scene, which could provide a wealth of information about the scene’s noise distribution or phase properties.
  • Related work in ToF imaging (Baek et al. 2022, Su et al. 2018) have demonstrated improvements over other classical pipelines using neural nets, indicating that more information is available than the classical techniques capture.

Comparing the results of our classical pipeline to that of the slower but more robust optical coherence tomography (OCT), we do see room for improvement. In our experiments, the depth map generated by OCT is treated as ground truth.

Figure 4. This comparison of depth maps generated by OCT (left) and synthetic-wave SWI (right) shows that SWI underperforms OCT.


In the classical pipeline, we first compute envelope images from by squaring, averaging, and subtracting the input image stack. This process leaves us with the equivalent of the input images to PSI, without the carrier-wave oscillations.

Figure 5. The first step of the classical pipeline is to remove the carrier wave from the captured frames, leaving us with envelope images that are equivalent to the inputs of PSI.

Denoising in the process occurs at this stage: either a Gaussian filter or a bilateral filter conditioned on the RGB image is applied to the envelope images, smoothing them out.

Figure 6. The computed envelope images are smoothed using either a Gaussian filter or a bilateral filter.

However, this approach discards what may be useful correlations between nearby pixels in the captured frames. We aim to demonstrate a CNN that is capable of learning a more robust envelope extraction function before correlations are removed by denoising. This work is in progress.

Simulating SWI Frames

In order to train EnvelopeNet, we need a large number of diverse scenes with ground-truth depth and realistic SWI image stacks. Unfortunately, OCT is a time-consuming process (on the order of hours per scene), which means that it is not feasible to create a real training dataset. We address this problem by creating a SWI simulator that generates an SWI image stack based on an input depth map.

To demonstrate the correctness of the SWI simulator, we provide it the OCT depth map of a test scene, then take the simulated frames and run it through the classical pipeline. Theoretically, this is a round-trip operation and should provide us similar results to that of the pipeline run on the real SWI input frames.

Figure 7. Ground-truth (OCT) depth is on the left; select simulated SWI frames (four of the 16) are in the center; recovered depth from simulated SWI frames is on the right. Note the similarity to the output from the real SWI frames in Figure 4.

To generate a sufficiently large dataset, we apply the SWI simulator to the vast trove of ground-truth images available in the Hypersim Dataset.

Measuring Confidence

While providing a confidence value of each pixel to the bilateral solver (introduced in the next section) is optional, we are currently exploring multiple heuristics to measure confidence of the initial phase estimate from EnvelopeNet. This will allow the denoiser to keep areas with high confidence while strongly denoising areas with low confidence.

  1. Observed Fisher Information: based on variance (over the four samples) of optimal phase value
  2. Fisher Information: based on expected variance of optimal phase values
  3. “Signal to Noise Ratio” between amplitude of interference term and non-interference term
  4. Parametrization of Observed Fisher Information as a MLP to learn optimal confidence

Differentiable Bilateral Solver

Since the loss used to train EnvelopeNet is between the output depth map and the ground-truth depth map generated by OCT, we need to be able to backpropagate through the complete pipeline, including the denoiser. This requires us to replace the bilateral filter, which is not differentiable, with the fast bilateral solver (Barron et al. 2015).

Figure 8. (a) initial depth estimate. (b) bilateral filter applied to (a). (c) bilateral solver applied to (a). (d) bilateral solver with optimal λ parameter applied to (a). (e) ground truth depth image. Note the similarity between (d) and (e).

We get improved results even without optimizing the initial layers of our pipeline. The solver with an optimized value of λ parameter, which is a weight between the smoothness term and the fidelity term, already demonstrates results similar to ground truth in this example.