Our Approach

Overall architecture of our approach

Here we describe each component of our pipeline for estimating depth from focal stack.

Multiplane Image (MPI)

We represent the scene with Multiplane Image, which is a set of fronto-parallel planes at different predefined depths of the scene. Each plane contains RGB value and alpha or transparency value at a specific pixel location and depth layer. RGB images could be rendered with alpha compositing, and disparity map can be generated by taking expected value of the disparity map.

Architecture and rendering process of a Multiplane Image

We choose the MPI representation due to it is fast rendering speed, known depth value of each layer and ability to work with patch-based losses such as GAN loss, Perceptual or SSIM losses.

Apply Defocus effect to MPI

To simulate defocus effect with MPI representation and reconstruct each image of a focal stack, we blur different parts of the scene with different amount of blur based on the difference between focal distance and depth at each layer of MPI. Specifically, we apply disk kernel with different diameter to each layer of the MPI.

Based on the Thin Lens Model, given camera parameter aperture and focal length, and focal distance which vary among images from the focal stack, diameter of defocus is at a particular scene depth:

aperture * |focal distance – depth| * focal length / (focal distance-focal length) / depth

The relationship between scene depth and blur radius is illustrated in the diagram below.

Relation between blur radius and scene depth

After convolving disk kernel of appropriate radius with each MPI layer, we perform alpha compositing of the layers to get the reconstruction. The reconstruction process is illustrated below:

Reconstruction of focal stack

Optimization of MPIs

We compare the reconstructed image from the focal stack and the ground truth and optimize for the RGB and alpha values stored in each layer of the MPI through gradient descent by applying gradient of the loss with respect to the RGB and alpha values.

The loss functions we use include the following:

  • RGB L1 reconstruction loss: enforces reconstructed image to have the same RGB value as the ground truth, L1 is used to increase sharpness
  • L1 sparsity loss on alpha values over all depth layers: enforces object to appear on a few of the MPIs, or each pixel should have high alpha values on a few of the layers.
  • SSIM loss: compares perceptual similarity of the reconstruction compared to ground truth

We also apply the SSIM and RGB L1 reconstruction loss between reconstruction by alpha compositing MPIs without applying any blur kernel and an all-in-focus image of the scene. This extra term enforces the MPIs to learn the scene content without influence of the blur kernels, increasing sharpness of the scene.


Here we showcase our preliminary results on a single scene. We generated the synthetic focal stack of 30 images with focal distance increase linearly from 2-5m using Blender. We implemented our pipeline using PyTorch. We use 32 MPI layers, linear in disparity.

Depth estimation

Below is the ground truth depth (left) and predicted depth (right).

It can be observed that our approach is able to obtain good depth with relative difference corresponding to the ground truth depth map. However, there is a mismatch between the ground truth and predicted depth. Many details are also not reconstructed or blurry.

Reconstructed focal stack

Below is our reconstructed focal stack with various focal distances. We can see that the rendered focal stack captures the defocus effect accurately, with some details not rendered correctly, and edges of the object being blurry.

Learnt MPI layers

Below is our learned RGBA MPI layers at various depths of the scene. While majority of the layers have learnt the correct content, many details of the scene are in the wrong planes, resulting in insufficient / inaccurate reconstruction of the details seen in the predicted depth and reconstructed focal stack.