Problem Statement
Given a set of cameras and their known intrinsic parameters, estimate the 6DoF pose ( i.e., Calculate Rotation and Translation parameters) of each of the cameras present in the setting.
Solution
This is a well-researched problem and can be solved from Structure from Motion(SfM) and a Deep learning-based approach. But SfM approach performs better than deep learning-based. Hence Structure from motion approach is used to obtain the Rotation and Translation parameters of the cameras. Here is the proposed pipeline of the work based on Incremental SfM
Correspondence search and matching
Keypoint Detection and Description
Identifying keypoints and describing them is an ill posed task as there is no correct solution for the porblem. A keypoint detector and descriptor should find interesting points in the images and describe them in a manner which would be invariant to viewpoint, illumination etc. I have chosen to use SuperPoint model for these tasks. It is a self supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems in computer vision.
Feature Matching
It is a task of matching keypoints from one image onto another image capturing same part of the image. This helps us finding correspondences in the images and removes outliers. I have opted SuperGlue model for this. It is a graphical neural network approach to match keypoints. One the the interesting things of this method is to take whole image as input for match as opposed to part of image which helps it avoid mismatching points of similar texture.
Pose Estimation
This solution approach is inspired by classical multi-view geometry. Let us consider N set of cameras. Initially, we will select a pair of cameras and undistort their images. Next we would get keypoint correspondences and compute Fundamental matrix between them using a standard 8point algorithm. Later we obtain Essential matrix from Fundamental matrix and decompose it into Rotational and Translation parameters. Furthermore we can obtain 3D points using triangulation and perform local bundle adjustment to optimize the parameters. For each successive camera, we try to find 2D-3D correspondence pairs in the image with the help of previous images and their projected 3D points and solve PnP problem to obtain its camera pose.
Evaluation Metric
For evaluation, I have taken Reprojection error as metric. Reprojection error as the name suggests an error on projecting one key point from an image to another image.
Result
Stats | SuperPoint + SuperGlue |
Number of 3D points | 747 |
Mean observation per image | 589 |
Mean Reprojection error (in px) | 0.36 |