We study the task of novel-view reconstruction using a few (3-5) input views. Our key intuition is that given a large collection of sparse-view images of objects, one can learn a prior describing the geometry of such objects. This prior knowledge of objects can be harnessed at inference to reconstruct novel views by observing the object from just a few viewpoints.
Building a rich representation. The ability to infer the 3D structure of the world around us is crucial for applications across AI and graphics. For instance, robotic agents that can manipulate generic objects, AR assistants that can understand user activities, or even mapping and exploration techniques – can all benefit from 3D scene understanding. Undoubtedly, reasoning in a 3D-aware space is expected to yield a richer representation of the environment. It is therefore imperative to study techniques that can capture the 3D geometry of the environment.
Synthesizing novel views. Traditional methods build explicit 3D models of objects such as pointclouds and meshes, that can be used for downstream tasks. However, such solutions often require 3D data that is scarce and expensive to collect. With the success of deep learning methods, there has been a growing interest in implicit neural representations that capture the geometry of objects using a deep neural network. One of the core strengths of such methods is the ability to render novel viewpoints, i.e., understanding what an object would look like from an “unseen” view.
Over the course of this project, we wish to study such techniques that implicitly capture the geometry and synthesize novel views of the objects. We hope that this ability would help us build higher-level representations of the world around us.
Sparse-view 3D Reconstruction. We model the problem as a ‘sparse-view’ reconstruction of generic objects – given a few images of an object with approximate viewpoint information, we wish to render the object from novel viewpoints. In order to achieve this, we aim to combine geometry-based and learning-based techniques to optimally leverage training-time datasets and inference-time sparse-view images. While we describe this paradigm for objects, it can be extended to arbitrary scenes as well.
Bridging single and multi-view methods. On one hand, single-view reconstruction methods can predict 3D given just one input image of an object, but, they are limited to a small set of categories and are unable to easily model geometry. On the other hand, classical or optimization-based multi-view methods can accurately reconstruct generic objects but require densely captured viewpoints. Moreover, since they do not learn priors across objects, they are unable to predict structure that is not directly observed. Sparse-view reconstruction combines the essence of single-view and multi-view methods, in that, we aim to seek generalization across novel objects (or scenes, as the case may be). The overarching objective is to faithfully reconstruct what is observed, and hallucinate that which is hidden, conditioned on the few input images that provide an object-specific prior.
Learning a prior from a training dataset. Our key intuition is that given a large dataset containing various objects (or scenes), one can learn a prior about the geometry of such entities. Owing to this prior, a few input views could provide meaningful information to infer novel viewpoints. More formally, suppose we are given a dataset of posed images (xn, cn) where x and c represent the image and the corresponding camera viewpoint, we wish to extract an object (or scene) prior p,
p = E(xn, cn)
Given this prior, one can synthesize images y from novel query viewpoints q,
y = D(q | p)
The goal of the training stage is to learn the functions E which encodes the prior and D which decodes novel views. Then, at inference, one can supply input views (xn, cn) and query for novel viewpoints q using these learned functions.
To summarize, we aim to develop a method that combines aspects of single-view and multi-view reconstruction approaches – enabling generalization by learning priors across objects, while allowing us to capture finer details using multi-view constraints.