Supervised learning approach

As introduced in the home page, we can now represent an image as a quadtree. But how do we learn the quadtree structure with a CNN? 

Looking into the literature, a typical approach to generate image from a latent code is to utilize a “decoder” network. A decoder network consists of a series of transpose convolution, which upsamples the feature gradually to the image dimension. Our idea is that we could map the blocks at each tree level to the output feature map of each transpose convolution. Starting from the first 2×2 feature map produced by transpose convolution, we map it to the second level in the quadtree. The transpose convolution would further upsample the feature map to 4×4, which corresponds to the third tree level.

On top of each feature map, we develop two heads with a 1×1 convolution. First is the class head, which predicts whether each node should further split or terminate. During training, we employ cross-entropy loss and supervise it with two classes, where class “0” represents termination and class “1” represents split. Second is the color head, which predicts the color of each node. We use L2 loss to supervise it during training. One thing to note is that we only compute loss for leaf nodes in the ground truth quadtree, while other locations are masked out.

Before passing the feature map to the next stage of each transpose convolution, we mask the feature map with 0 so that only nodes that need further splitting retain their feature value. In practice, we could utilize “Sparse Convolution”  [2]  to incorporate this masking operation and only compute the features that are non-zero throughout the whole network. By applying a series of transpose convolution, we obtain the final quadtree structure. And we could get the generated image by aggregating the color prediction masked by the class prediction at each stage.

In our experiment, we give our network a 8-dimension latent code, which controls the location, rotation, scale, color of a rectangle in a black background image. We ask our network to predict the quadtree structure for representing the rendered image. Results are shown below. There are three main observations:

  1. During training, the network could represent the image very well. However, the quality starts to deteriorate drastically during testing.
  2. The color prediction is not uniform, even though we think it is an easy task for the network to learn to predict the same color in all color heads.
  3. There are some weird gray artifacts in the final representation.

Here, we won’t go into the detailed analysis of those failure cases. But the conclusion is that the network has a hard time learning tree structures that it hasn’t seen during test time. This is because we’re only supervising the training with a single ground truth tree structure. So the network couldn’t really learn to explore unknown tree structures on itself. Another problem is more fundamental. As the network makes a decision to split each node or not in each tree level, it doesn’t know whether making the opposite decision is better or not. This makes the gradient-based CNN learning very difficult to learn this sequential decision process. One evidence is that when we visualize the images predicted by our network as we interpolate one dimension of the latent code (down below), the transition is not smooth because we couldn’t really “interpolate” between different tree structures.

Ground truth
Network generated result

Based on this conclusion, we decide to try out another learning algorithm that is more suitable for a sequential decision process, i.e. Reinforcement Learning (RL).

Reference

[2] https://github.com/facebookresearch/SparseConvNet