Structure from Motion
Structure from Motion (SfM) is a technique that utilizes a set of images from multiple views of a scene or an object to reconstruct its 3-dimensional structure.
A general SfM pipeline includes 5 steps:
- Feature Extraction: Extracting interest points from each image
- Feature Matching: Generating correspondences between pairs of images
- Pose Estimation: Determining the position and orientation of the camera relative to a reference
- Triangulation: Determining points in 3D space given their projections onto two or more images
- Bundle Adjustment: Jointly refining the 3D reconstructed points and the camera parameters by minimizing a reprojection error
We have seen that traditional SfM pipelines work really well in cases with a lot of images. Let us look at a case with fewer number images. Suppose you are surfing through craigslist and you stumble upon an antique dresser that you want to model for your house in the Metaverse. And you see that you have only 8 images. But you try the SfM pipeline anyway. What you get is either the sparse 3-D point cloud that barely resembles the dresser or the traditional SfM pipeline fails to build 3-D point clouds. That’s why we need a powerful SfM that is able to generate good 3D points and camera poses even with only little images.
Though whole we want to solve the SfM as a whole but for our initial experiments, we are looking at pose estimation in particular.
Given a set of input views and a query image, we are trying to predict the camera pose of the query image.
Let us look at how humans approach this problem. If you are shown images of chairs taken from different views, you will instantly guess the directions from where the images are taken. There are correspondences that we latch on to, but we are able to do this efficiently because we have a higher level of semantic understanding of chairs which we have built by looking at a lot of images of chairs before. This makes us believe using data-driven priors will be very useful for SfM.
Scene Representation Transformers
Our current work is inspired by Scene representation transformers which use data-driven priors to generate novel views from multi-view inputs. Given a set of multiview inputs and a query pose, the SRT generates the image corresponding to the query pose.
Our Proposed method is very close to SRT. It has two major differences. The first is that instead of concatenating the images and camera poses into the CNN backbone and training from scratch, we concatenate the camera poses after the CNN so we can use the pre-trained models of CNN. The second difference is rather than outputting the generated views from query rays, we try to get the estimated camera poses from query images which are also the outputs of CNN.