3D Reconstruction From Motion: November 2011

Wednesday, November 30, 2011

Formulation for two-view depth estimation

Here is the formulation we use to get the depth estimation from the disparity data.

We now describe the geometry that defines the relationship between two corresponding pixels and the depth of a 3D point. Let us consider a 3D Euclidean point (X,Y,Z)T captured by two cameras and the two corresponding projected pixels p1 and p2. Using the camera parameters (see Equation (2.18)), the pixel positions p1 = (x1,y1,1)T and p2 = (x2,y2,1)T can be written as

Figure 3.2: Two Aligned cameras capturing rectified images can be be employed to perform the estimation of the depth using triangulation.

The previous relation can be simplified by considering the restricted case of two rectified views so that the following assumptions can be made. First, without loss of generality, the world coordinate system is selected such that it coincides with the coordinate system of camera 1 (see Figure 3.2). In such a case, it can be deduced that C1 = 03 and R1 = I3×3. Second, because images are rectified, both rotation matrices are equal: R1 = R2 = I3×3. Third, camera 2 is located on the X axis: C2 = (tx2,0,0). Finally, and fourth, both cameras are identical, so that the internal camera parameter matrices are equal, leading to

Equation (3.4) provides the relationship between two corresponding pixels and the depth Z of the 3D point (for the simplified case of two rectified views). The quantity f ⋅tx2∕Z is typically called the disparity. Practically, the disparity quantity corresponds to the parallax motion of objects 1 . The parallax motion is the motion of objects that are observed from a moving viewpoint. This can be illustrated by taking the example of a viewer who is sitting in a moving train. In this example, the motion parallax of the foreground grass along the train tracks is higher than a tree far away in the background.
It can be noted that the disparity is inversely proportional to the depth, so that a small disparity value corresponds to a large depth distance. To emphasize the difference between both quantities, we indicate that the following two terms will be used distinctively in this thesis.
Disparity image/map: an image that stores the disparity values of all pixels.
Depth image/map: an image that represents the depth of all pixels.
The reason that we emphasize the difference so explicitly, is that we are going to exploit this difference in the sequel of this thesis. Typically, a depth image is estimated by first calculating a disparity image using two rectified images and afterwards, by converting this disparity into depth values. In the second part of this chapter, we show that such a two-stage computation can be circumvented by directly estimating the depth using an alternative technique based on an appropriate geometric formulation of the framework.

The link is as follows:

http://www.epixea.com/research/multi-view-coding-thesisse13.html

Basic Approach for Depth Map from Stereo Images.

We have the basic knowledge that there are binocular disparities in stereo images, which means if the object is closer to the camera, the position difference of it in two images will be larger. (Assume the orientation of the two cameras are the same, they only differs in position)

I took 2 images from my cell phone (really low quality):

I reused the code from my image mosaic project, it loads two images, runs ANMS, extracts the features, map the features and launches RANSAC (with lose threshold, just roughly get out the ridiculous mapping) to get the match. This is the normalize map of the position difference:

Then I realized because of the ANMS, we have included too much low corner response points. So I ran it again without ANMS, and this time, the result is a little bit better:

We are still trial and error in the first step. In this step, we need to know the exact depth distance rather than the relative depth. We plan to count the length focal of the two cameras into our calculation.

Monday, November 28, 2011

CIS 581 Computer Vision | Final Project Proposal

1. Problem Description:

We want to implement a 3D reconstruction from motion project, which takes the videos from two cameras with fixed relative position and orientation as input, and output the features with depth value, and finally reconstruct the 3D world. If possible, we also want to estimate the camera ego-motion and motion-field in the video from the stereo images we created.

2. Related Work:

a) Feature detection and description.
b) Depth recognition from two images.
c) 3D Reconstruction from features with depth.
d) Ego-motion estimation from stereo images.
e) Motion field estimation from stereo images.

3. Milestones:

The time for our final project is not as long as we expected, therefore we created 3 milestones(stages) for the whole project. Stage I and Stage II are what we must implement during this semester and Stage III is the extra stage which we might only implement partially. And we would definitely polish this project in the winter break.

Stage I: (Preparation Stage)

a) Background reading: because we are going to implement Hernan and Takeo’s paper “A Head-Wearable Short-Baseline Stereo System for the Simultaneous Estimation of Structure and Motion”[1] which describes a decent method on 3D reconstruction, camera movement estimation and motion field estimation from stereo images, we need to read carefully until we fully understand it.
b) Hardware preparation: For this project, we need at least two cameras with fixed relative position and orientation(Like human eyes). Thus we want to use two web-cams and duct-tape them to a plate. This might be the simplest way to get the hardware, but we still need to tune around to see whether we need to get the two cameras with same focal, same resolution and same exposure.
c) Feature descriptor and feature match: At this stage, we can simply use the corner features, run ANMS, extract the 40*40 neighbor, blur(gaussian or geometric) it, sub sample it to 8*8 and use this as the feature of a image. We can also use RANSAC model to match the features in two images. These are already done in our project 3. We will leave the feat_desc and feat_match functions as virtual, so we can modify them to any feature descriptors in the future.

Stage II: (Working Stage)

All the staff we are doing here is going around the feature correspondence.
a) Calculate the depth map from 2 images from the two “eye” cameras: This is the first step we should take in this project. We need to get the feature desciptors from the two images, find the correspondence, and get the depth map from binocular disparity.
b) Pure translational movement test: link all the images with depth map in a pure translational motion to a 3D scene.
c) Pure rotational movement test: link all the images with depth map in a pure rotational motion to a 3D scene.
d) Reconstruct the 3D scene from a sequence of images with depth map.

Stage III: (Challenging Stage)

a) Ego-motion estimation: because we don’t have any inertia detection devices, we need to estimate the camera movement from the vision input. (a1) We will try to find the position and orientation using the data in stage II (d); (a2) We can interpolate the camera transformation between the key-frames we got in (a1).
b) Motion-field estimation: (b1) directly apply optical flow algorithm to see whether we can get the motion-field while keep our camera still; (b2) detect the motion field with the camera movement, it needs to know the camera velocity and all the object velocities in 3D.
c) Acceleration: First step we want to accelerate this project is migrating the whole framework to c++ with OpenCV. The second step is trying to apply GPU accelerations to enable parallellized computing in some key steps.

4. Timeline:
Stage I should be finished within a week, which would be ended before Thur. Dec. 1st 2011.
Stage II would take the majority of time of us, it will lasts for at least 2 weeks, which will be mostly done before Fri. Dec. 16 2011.
For now, we are still doubt what we can deliver in the final deadline. So we divide our stages into a lot of sub-stages, our best wish is finishing all the work in Stage I and II and III (a) before the due date, and we would likely to polish our work in the winter break.

Reference:

[1] Badino, Hernan., Kanade, Takeo. (2011) A Head-Wearable Short-Baseline Stereo System for the Simultaneous Estimation of Structure and Motion, 12th IAPR Conference on Machine Vision Applications