3D Reconstruction From Motion

Wednesday, December 7, 2011

How we get the 3D coordinate from the depth data.

With the disparity, we can get the distance for each feature based on the formulation we cited in our former blog:

Then we can get the distance zs, which is the z-axis coordinate in the camera-based real-world coordinate system. And based on Z value, we can calculate the x-axis and y-axis coordinate with the following codes:

%calculate y and z values
xs = -(xis-cx) ./ fx .* zs;
ys = -(yis-cy) ./ fy .* zs;

The 'fx' and 'fy' are the focal length for x-direction and y-direction respectively. The 'cx' and 'cy' are the central spot for both axes based on the unit 'pixel'. The 'xis' and 'yis' are the pixel coordinates for the feature points. The 'xs', 'ys' and 'zs' are the world coordinate we get.

Thus, we get the 3D coordinates for those feature points. And for next step, we would like to use RANSAC to get the 3D homography translation.

Tuesday, December 6, 2011

Camera Alignment

For the simplicity of usage, we want to get our camera aligned to the same direction.
The approach we tried first is to draw the points to a paper as show in the figure:

The distance of the two points is exactly the distance between the two cameras.
Then we can open the image acquisition tool box in Matlab, aim the center of the first camera to the first point:

And aim the center of second camera to the second point:

And after I aimed two cameras to both of the points, they should be aligned: (See figure)

Scripts that we wrote to acquire the images.

The first way we thought about is to use the Matlab image acquisition tool box. (use imaqtool to call the toolbox)

It provides pretty good results for a single camera. But our demand is to get two images from the two cameras at the same time.
So I write a simple script to capture the images synchronously.
// Get the videoinput object:
cam1 = videoinput('dcam', 1, 'Y422_800x600');
cam2 = videoinput('dcam', 2, 'Y422_800x600');
// Change the camera property:
src = getselectedsource(cam1);
set(src, 'AutoExposure', 80);
// ... a lot of properties adjustments here...

// start capturing:
preview(cam1);

preview(cam2);
index = 1;
while 1
key = input('x = exit acquisition, enter = acquire a pair of frame', 's');
if key == 'x'
break;
else
im1 = getsnapshot(cam1);
im2 = getsnapshot(cam2);
im1 = ycbcr2rgb(im1);
im2 = ycbcr2rgb(im2);
img_pair = {im1, im2};
img_pairs{index} = img_pair;
index = index + 1;
end
end
stoppreview(cam1);
stoppreview(cam2);
closepreview(cam1);
closepreview(cam2);

this code get the keyboard as input, it gets every frame once I clicked enter end exits when I hit 'x'.
If I want get continues frames as videos, I can simply comment the "keyboard input" line and let it acquire images every frame.

Monday, December 5, 2011

Camera Calibration with the help of MATLAB toolbox

In order to get an accurate result from disparity to depth, we need to know the intrinsic parameter, such as focal length and skew coefficient, for our camera. So we use the Caltech Camera Calibration toolbox to get those parameters for the two cameras separately.

We take around 200 pictures with the checkerboard and pick up 12 pictures for calibration for each camera. And then follow the instruction on the website to obtain the parameters.

Picture.1 calibration images

Picture.2 extract the grid corner

And here is the result of calibration:

Calibration results (with uncertainties):
Focal Length:          fc = [ 1762.62339   1771.13354 ] ± [ 33.53921   33.64017 ]
Principal point:       cc = [ 657.10654   494.14503 ] ± [ 45.24948   41.64433 ]
Skew:             alpha_c = [ 0.00000 ] ± [ 0.00000 ]   => angle of pixel axes = 90.00000 ± 0.00000 degrees
Distortion:            kc = [ -0.24440   0.25448   0.00050   -0.00360 0.00000 ] ± [ 0.06230   0.19754   0.00330   0.00715 0.00000 ]
Pixel error:          err = [ 0.25417   0.35936 ]

The links for the website for the calibration toolbox is as follows:
http://www.vision.caltech.edu/bouguetj/calib_doc/htmls/example.html

And we would like to use the extrinsic parameters( such as rotation and translation) from the toolbox, to help us to adjust the relative motion between our cameras as well.

Wednesday, November 30, 2011

Formulation for two-view depth estimation

Here is the formulation we use to get the depth estimation from the disparity data.

We now describe the geometry that defines the relationship between two corresponding pixels and the depth of a 3D point. Let us consider a 3D Euclidean point (X,Y,Z)T captured by two cameras and the two corresponding projected pixels p1 and p2. Using the camera parameters (see Equation (2.18)), the pixel positions p1 = (x1,y1,1)T and p2 = (x2,y2,1)T can be written as

Figure 3.2: Two Aligned cameras capturing rectified images can be be employed to perform the estimation of the depth using triangulation.

The previous relation can be simplified by considering the restricted case of two rectified views so that the following assumptions can be made. First, without loss of generality, the world coordinate system is selected such that it coincides with the coordinate system of camera 1 (see Figure 3.2). In such a case, it can be deduced that C1 = 03 and R1 = I3×3. Second, because images are rectified, both rotation matrices are equal: R1 = R2 = I3×3. Third, camera 2 is located on the X axis: C2 = (tx2,0,0). Finally, and fourth, both cameras are identical, so that the internal camera parameter matrices are equal, leading to

Equation (3.4) provides the relationship between two corresponding pixels and the depth Z of the 3D point (for the simplified case of two rectified views). The quantity f ⋅tx2∕Z is typically called the disparity. Practically, the disparity quantity corresponds to the parallax motion of objects 1 . The parallax motion is the motion of objects that are observed from a moving viewpoint. This can be illustrated by taking the example of a viewer who is sitting in a moving train. In this example, the motion parallax of the foreground grass along the train tracks is higher than a tree far away in the background.
It can be noted that the disparity is inversely proportional to the depth, so that a small disparity value corresponds to a large depth distance. To emphasize the difference between both quantities, we indicate that the following two terms will be used distinctively in this thesis.
Disparity image/map: an image that stores the disparity values of all pixels.
Depth image/map: an image that represents the depth of all pixels.
The reason that we emphasize the difference so explicitly, is that we are going to exploit this difference in the sequel of this thesis. Typically, a depth image is estimated by first calculating a disparity image using two rectified images and afterwards, by converting this disparity into depth values. In the second part of this chapter, we show that such a two-stage computation can be circumvented by directly estimating the depth using an alternative technique based on an appropriate geometric formulation of the framework.

The link is as follows:

http://www.epixea.com/research/multi-view-coding-thesisse13.html

Basic Approach for Depth Map from Stereo Images.

We have the basic knowledge that there are binocular disparities in stereo images, which means if the object is closer to the camera, the position difference of it in two images will be larger. (Assume the orientation of the two cameras are the same, they only differs in position)

I took 2 images from my cell phone (really low quality):

I reused the code from my image mosaic project, it loads two images, runs ANMS, extracts the features, map the features and launches RANSAC (with lose threshold, just roughly get out the ridiculous mapping) to get the match. This is the normalize map of the position difference:

Then I realized because of the ANMS, we have included too much low corner response points. So I ran it again without ANMS, and this time, the result is a little bit better:

We are still trial and error in the first step. In this step, we need to know the exact depth distance rather than the relative depth. We plan to count the length focal of the two cameras into our calculation.

Monday, November 28, 2011

CIS 581 Computer Vision | Final Project Proposal

1. Problem Description:

We want to implement a 3D reconstruction from motion project, which takes the videos from two cameras with fixed relative position and orientation as input, and output the features with depth value, and finally reconstruct the 3D world. If possible, we also want to estimate the camera ego-motion and motion-field in the video from the stereo images we created.

2. Related Work:

a) Feature detection and description.
b) Depth recognition from two images.
c) 3D Reconstruction from features with depth.
d) Ego-motion estimation from stereo images.
e) Motion field estimation from stereo images.

3. Milestones:

The time for our final project is not as long as we expected, therefore we created 3 milestones(stages) for the whole project. Stage I and Stage II are what we must implement during this semester and Stage III is the extra stage which we might only implement partially. And we would definitely polish this project in the winter break.

Stage I: (Preparation Stage)

a) Background reading: because we are going to implement Hernan and Takeo’s paper “A Head-Wearable Short-Baseline Stereo System for the Simultaneous Estimation of Structure and Motion”[1] which describes a decent method on 3D reconstruction, camera movement estimation and motion field estimation from stereo images, we need to read carefully until we fully understand it.
b) Hardware preparation: For this project, we need at least two cameras with fixed relative position and orientation(Like human eyes). Thus we want to use two web-cams and duct-tape them to a plate. This might be the simplest way to get the hardware, but we still need to tune around to see whether we need to get the two cameras with same focal, same resolution and same exposure.
c) Feature descriptor and feature match: At this stage, we can simply use the corner features, run ANMS, extract the 40*40 neighbor, blur(gaussian or geometric) it, sub sample it to 8*8 and use this as the feature of a image. We can also use RANSAC model to match the features in two images. These are already done in our project 3. We will leave the feat_desc and feat_match functions as virtual, so we can modify them to any feature descriptors in the future.

Stage II: (Working Stage)

All the staff we are doing here is going around the feature correspondence.
a) Calculate the depth map from 2 images from the two “eye” cameras: This is the first step we should take in this project. We need to get the feature desciptors from the two images, find the correspondence, and get the depth map from binocular disparity.
b) Pure translational movement test: link all the images with depth map in a pure translational motion to a 3D scene.
c) Pure rotational movement test: link all the images with depth map in a pure rotational motion to a 3D scene.
d) Reconstruct the 3D scene from a sequence of images with depth map.

Stage III: (Challenging Stage)

a) Ego-motion estimation: because we don’t have any inertia detection devices, we need to estimate the camera movement from the vision input. (a1) We will try to find the position and orientation using the data in stage II (d); (a2) We can interpolate the camera transformation between the key-frames we got in (a1).
b) Motion-field estimation: (b1) directly apply optical flow algorithm to see whether we can get the motion-field while keep our camera still; (b2) detect the motion field with the camera movement, it needs to know the camera velocity and all the object velocities in 3D.
c) Acceleration: First step we want to accelerate this project is migrating the whole framework to c++ with OpenCV. The second step is trying to apply GPU accelerations to enable parallellized computing in some key steps.

4. Timeline:
Stage I should be finished within a week, which would be ended before Thur. Dec. 1st 2011.
Stage II would take the majority of time of us, it will lasts for at least 2 weeks, which will be mostly done before Fri. Dec. 16 2011.
For now, we are still doubt what we can deliver in the final deadline. So we divide our stages into a lot of sub-stages, our best wish is finishing all the work in Stage I and II and III (a) before the due date, and we would likely to polish our work in the winter break.

Reference:

[1] Badino, Hernan., Kanade, Takeo. (2011) A Head-Wearable Short-Baseline Stereo System for the Simultaneous Estimation of Structure and Motion, 12th IAPR Conference on Machine Vision Applications