This is a capstone project at Carnegie Mellon University (CMU) sponsored by Facebook Reality Labs (FRL). The project is advised by Jason Saragih from FRL and Ji Zhang from CMU.


To build the new era of the metaverse, one of the biggest challenges is how to replicate the face-to-face challenge while people are remote. To build avatars, there are various elaborate steps involved, especially for motion tracking. One of the effective tools that make this work is to use neural networks to build a volumetric model, which you can imagine in a 3D space, each voxel has corresponding RGB color and transparency to describe the scene. And then using the ray marching technique to integrate the rgba of every voxel along the ray from a specific viewpoint. Finally, to determine the color shown in 2D space. Ray marching could be very slow because this method represents the space uniformly. Therefore, we are wondering if there are ways to make it a little bit faster. One fact we may leverage is that, in most of the interesting scenes, most of the space is in fact empty or transparent. So one possible solution might be to use a more efficient data structure to represent the scene, saving memory for the space you don’t really care about. One of the classical efficient data structures is octree, or quadtree in 2D space. Therefore, we want to explore the possibility of using octree to represent visual data, and do ray marching on the sparse octree to efficiently render images.


Our final goal is to render an image of a 3D object efficiently with an octree given latent vector of novel view. Given a latent vector that encodes the information of the target object, we want to design a neural network that can learn to generate the corresponding octree dynamically. In other words, we want to learn an octree to represent visual data. Since the octree representation, theoretically, can use less memory to represent the object, the ray marching process can accordingly save more time and computation. In short, in this task, we have a latent vector as the input, and want to learn a model to generate an octree structure, or quatree structure in 2D space, as the output prediction.

In this project, we introduce two methods to learn the quadtree/octree structure for representing 2D image/3D volume. For simplicity, we first experiment our idea with 2D data. After it works, we can extend it to 3D.


How do we represent the image as a quadtree?

First of all, we discuss how to represent an image as a quadtree. Given an image, we know the color of each pixel, such as the 4×4 image below. An efficient way to represent pixels is to group neighboring pixels of the same color together. For example, the bottom left 2×2 pixels all have the color yellow, so we could group the pixels into a “block of yellow”. If we group pixels into blocks intelligently, we could actually use a tree structure to save the relationship between blocks. The example below shows a particular tree structure called a “quadtree”. 

A quadtree has at most 4 children at each node. It is the most straightforward way to represent an image as a tree. Starting from the root of the tree, this corresponds to the block where we group all pixels into one giant block, e.g. grouping all 4×4 pixels below. Obviously, with only one root node is not a reasonable representation, because the pixels in the block have different colors. So we split the block into 4 sublocks evenly, i.e. growing 4 children for the root node. Now all blocks except the top left 2×2 block could correctly represent the color. So we further split the top left block. We iterate this process of splitting blocks until all pixel colors are represented correctly. Then we get our quadtree representation of the image.

Representing an image as a quadtree. Source: [1]


[1] Chitta, Kashyap, Jose M. Alvarez, and Martial Hebert. “Quadtree generating networks: Efficient hierarchical scene parsing with sparse convolutions.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020.