Structure from Motion

+ Prerequisites

3d scene with object and cameras surrounding it, showing 2d image planes, keypoints that fall on 3d object. Entire thing is rotating slowly.

Structure from Motion (SfM) is an approach to 3d reconstruction that estimates 3d points in an environment from a set of 2d images, and estimates the 3d pose of the camera that took every photo along the way. SfM works by combining many core topics in computer vision.

Feature Extraction

First we extract feature from each image. There are many techinques for feature extraction, including more recent approaches like SuperPoint that leverage deep convolutional neural networks. For this demo we'll use the classic the Scale-Invariant Feature Transform (SIFT).

Animation showing how SIFT works borrowed from SIFT page.

Feature Matching

Next we determine which pairs of images have matching features. We'll brute force check every pair of images, but there are alternative stragies available for larger datasets, and it's also often possible to have an initial sense of which photos are likely to have corresponding features if the photos were captured seqentially from the same camera.

Animation showing photos with keypoints, and two of them being checked for matches. The matches are highlighted.

Initial camera pose estimation.

We need an initial estimate of the pose of the camera that took each photo. Often times we may have access to additional sensor data that would help here, but we'll present a technique that works solely based on the matched keypoints in images. First we estimate the relative rotations of each camera, then their translations. Along the way we can reject image matching pairs that do not agree with the estimates.

Animation showing process of estimating rotations, then translations.

Estimate 3d points

Description

description of animation

Bundle Adjustment

During bundle adjustment we refine the estimate of each camera's intrinsic parameters, each camera's pose, and 3d point positions altogether my minimizing the reprojection error: the difference between where the camera model dictates the 2d point should be based on the 3d point and the camera's pose, and where it lies in the 2d image.

description of animation

By this point we have a refined estimate of the camera poses and 3d keypoints.