Tyler's Blog

Tuesday, May 31, 2011

Conjugate Gradient Part 1

Since my project is winding down fast, I thought I'd just expand more on the math behind the scenes.

The conjugate gradient method is an iterative algorithm that can solve equations of the form Ax = b, or be stopped half-way through to get a good estimate for x. It only works if A is positive-definite and square. It is only practical if A is sparse so multiplications by A are easy to compute.

Lets start with a quick definition:

Def: u and v are conjugate
w.r.t. A if u^TAv = 0. Basically, u and Av are orthogonal.

The problem is solving for Ax = b. x can be
broken down into a linear combination of basis of Rⁿ called {d_k}
or ‘directions’ that are all conjugate w.r.t to A. Whenever we talk about linear combinations, I like to think of each basis vector as a direction. Bear with me for a second while I glaze over why it is cool if they’re conjugate. Now take the equation:

x = ∑α_id_i (by linear combination def)

b = Ax = ∑α_iAd_i (by problem definition)

This isn’t terribly interesting, it is just rewriting Ax as A times a completely arbitrary decomposition of x into {d_k}. But imagine we know all the directions in {d_k} and we just wanted to solve for the weight α. Normally, this wouldn’t tell us anything: I can break x in R³ into α₁[1; 0; 0], α₂[0; 1; 0], α₃[0; 0; 1], and I still don’t have a good way to find α’s
and x. But if my bases are conjugate, I can use the following equation:

b = Ax = ∑_i α_iAd_i	(see above)
d_kb = ∑_i d_kα_iAd_i	(multiply both sides by d_k. Note that I know {d_i} so only unkowns are α_i’s.
d_kb = ∑_i d_kα_iAd_i = d_kα_kAd_k	(by the fact that d_iAd_k = 0 if i != k, by conjugate bases)
α_k = d_kb/d_kAd_k

So now we have an easy-to-solve equation for all the weights that make Ax = b true. The direct algorithm can be broken into two steps: find conjugate directions w.r.t A, then efficiently
solve for the α’s, and then add them all together to get x. In practice, all these steps can be intermixed in an iterative algorithm that updates x after every iteration.

Conjugate Gradient Algorithm:

for k=1:n

Find direction d_k

Find α_k

x = x + α_k d_k

But how do we find d_k? One can imagine that there are many different ways to compute a set of conjugate directions, but some may combine to form x in fewer iterations than others. This is where different methods appear such as MINRES, GMRES, and the one I’m concerned with: Conjugate Gradient. In fact, the manner of selecting d_k is where the conjugate gradient gets its name! Imagine we have an energy function: E(x) = ½ x^TAx - x^Tb + c, then the gradient (derivative) E’(x) = Ax – b, and we have a minimum for E(x) at 0 = b - Ax or b = Ax (basic calculus). If you were writing an iterative algorithm that finds the minimum of E(x) without solving directly for x, then you might just greedily follow the gradient until you hit the minimum, which is just: b – Ax where x is your current guess for x. Take note that b – Ax is the gradient and the residue in this special case, and the terms get intermixed frequently. If we just followed this greedy algorithm, we’d be doing a steepest descent algorithm. Unfortunately, these bases aren’t orthogonal or conjugate, and this
leads to inefficiency.

We’ll borrow the idea behind steepest descent to pick our first direction, that is, just say x = 0, plug it into b – Ax, and get b as our first derivative. For the rest of the directions, form conjugate bases to the previous bases with the following rules:

d₀ = b

d_k = r_k-1 - ∑_i<k((d_iAr_{k 1})/(d^T_iAd_i)) d_i

Basically, d_k is the previous derivative orthogonalized (or rather, conjugatilized because we multiply by A) with all d_i for i<k. This should feel a lot like Gram-Schmidt orthonormalization. Here’s an updated outline of the algorithm:

Conjugate Gradient Algorithm

for k=1:n

Find direction d_k

for i=1:k

‘conjugatilize’ d_k
with d_i

Find α_k

x = x + α_k d_k

So as a recap, now we have a good heuristic way of picking directions (d_k) such that we can exploit their conjugate property and solve for their weights (α_k).

The algorithm listed above captures all the intuition of conjugate gradient, but doesn’t look anything like what is actually implemented. The reason is that the inner for-loop can be optimized away along with the requirement to store all previous bases. I might expand on
this idea in a later post.

This is the more common form:

Conjugate Gradient Algorithm

for k=1:n

r_k+1 = r_k - α_k Ad_k

β_k = (r_k+1^Tr_k+1)/(r_k^Tr_k)

d_k+1 = r_k+1 + β_kd_k

α_k+1 = (r_k^Tr_k)/(d_k^TAd_k)

x_k+1 = x_k + α_kd_k

Gradient Descent

Given an energy function of the form E(x) = ½x^TAx - x^Tb + c, our task is to minimize E(x) without directly solving for x because it would be too expensive.

Sunday, May 22, 2011

Robust Estimation of Multiple Motions

I made a break in my effort to understand how Ce Liu’s algorithm works before I start rewriting portions of it. The appendix of his thesis lists a paper that I read named “The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields” by Black and Anandan, and now I understand a lot more code. Here’s my summary:

In optical flow, we can use brightness constancy to say the following:

Error(u,v) = ∑
(I(x,y,t) – I(x+udt, y+ vdt, t+dt))²

That is, the error E is a function of the displacement vectors u and v, and we can solve this equation to find them. We don’t square the term because it is important to our model; it is largely for convenience sake. With the square, we can take the derivative of the formula and get a linear equation on the right side and solve for when the left side, Error’(u,v), equals 0. Since the equation is convex (one local minimum), we know we’ve minimized the error. This is all pretty straightforward stuff I’ve worked on before.

However, there’s a catch. This method only works if data fits the model 100%. For instance, if brightness doesn’t hold because of depth discontinuity or multiple motions in the same window, then we’ll have an outlying point whose affect will be squared. The problem is similar to means and medians: the mean is sensitive to being skewed by an extreme outlier while the median is not. It would be nice if we had a median-like function that is as easy to solve for u and v with.

ρ: error normalization
function, Ψ: influence function (proportional to derivative of ρ)

Obviously, care should be taken
with choosing an error normalization function after considering issues of
convexity and rates of convergence.

In Ce Liu’s (the code I’m trying to
understand), he’s using

E(u, v)
=ψ(|I(p+w) − I(p)|²) + αφ(|∇u|² + |∇v|²)dp

ψ(x) = φ(x) =√x² + ε², (ε=0.001)

Interestingly, both of these are very non-convex, but Liu points out in his thesis that a dense Gaussian pyramid can overcome bad local minimum.

These extra functions have the effect of weighing the continuous gradient constraint (|∇u|² + |∇v|² = 0) against the constant brightness constraint (I(p+w) − I(p) = 0). Once all the math is worked out, it looks like each constraint is weighted inversely proportional to the amount of perceived error it has. This makes a lot of sense: suppress one constraint if that part of the model isn’t holding up.

Thursday, May 19, 2011

Gauss Seidel

Gauss-Seidel

Building on my post ‘Iterative
Approaches to Solving Large Systems of Equations’, I’m going to break the matrix A in Ax=b into an upper triangular and lower triangular form. So [2 -1; -1 2] will be broken into D=[2 0; -1 2] and R = [0 -1; 0 0]. This technique is a little sneakier than the Jacobi method, in that you never need to take the inverse of D. In other words, we’re not decomposing the matrix this way because D is easily invertible. Instead, the equation Dx_k+1 + Rx_k = b is easily solved by doing something that looks like a lot like back-substitution. For instance, in the example above, we could write out: [2 0; -1 2]x_k+1 + [0 -1; 0 0]x_k = b or [2 0; -1 2]x_k+1 = [0 1; 0 0]x_k + b. This equation has the nice property that the first entry of x_k+1, lets call it x_k+1(1) to keep our bastardized matlab syntax, can be solve instantly just like in back substitution. 2*x_k+1(1) = x_k(1) + b in our case. Then that result can be used to solve the next equation, and so on and so forth. I’ve written some code to do this:

% Finds a solution for Ax = b.

A = [12 3 -5; 1 5 3; 3 7 13];

b = [1; 28; 76];

syms x1 x2 x3;

initial_x = [1 0 1];

curr_x = initial_x;

next_x = zeros(1,3);

errors = zeros(1,3);

% Iteratively update your estimate of entries of x.

% Note that curr_x and next_x could be the same array, I
separate them

% just so I can display the error.

for k=1:6,

next_x(1) = solve(A(1,:)*[x1; curr_x(2); curr_x(3)] - b(1));

next_x(2) = solve(A(2,:)*[next_x(1); x2; curr_x(3)] - b(2));

next_x(3) = solve(A(3,:)*[next_x(1); next_x(2); x3] - b(3));

errors = abs((next_x - curr_x) ./ next_x)*100;

disp(['Absolute Errors ' num2str(errors)]);

curr_x = next_x;

end

curr_x

The important thing is how I solve the first equation with curr_x, but then I solve the second equation with next_x. I don’t have to wait to solve for all next_x before I can start using
its improved estimates. Although I don’t understand the proof for it, apparently this method converges twice as fast as Jacobi, and by combining curr_x and next_x, it is much more space efficient.

As an aside, I wrote this matlab code with the symbolic toolbox and it’s ‘solve’ function rather than just reorganize the linear equations to solve for the one unknown. This is MUCH slower for what should just be a series of adds and a divide, but it makes the code match the equation. I have a lot trouble understanding code when algebra and optimizations distort the idea to the point where it doesn’t match the paper/book at all. Apparently Mathematica is smarter about this stuff than Matlab is, so I’ll add that to my growing list of reasons to switch.

As another aside, I think this is kind of interesting that you can iteratively solve for the residual by solving b – Ax_k efficiently, and seeing how far you are from the solution. I think it’s kind of mind bending that you can be cranking away at a problem and know the entire time how far away you are from the real answer. I guess the residual isn’t as informative as the error x-x_k, but it’s just a matrix multiply away!

Tuesday, May 17, 2011

Current Direction

Last week I experimented with a half-dozen algorithms for doing dense correspondence matching and finally settled on this:

http://people.csail.mit.edu/celiu/OpticalFlow/

Mr. Liu uses an algorithm that combines minimizing local energy functions from Lucas-Kanade and a global energy function from Horn-Schnuck to get dense flow that is resilient to noise. I tried a similar algorithm earlier on, but I suppose they were just of inferior quality because this algorithm is doing extremely well.

Left Image

Right Image Warped

My one problem moving forward is still speed. It will take a few minutes to fully register a set of images, which is OK, but a little annoying with very large databases. If I can constrain the objective function in Liu's algorithm with calibration data, I can get a major performance boost. Liu is using some pretty heavy mathematics that I'll have to spend some time learning if I'm going to be able to make changes confidently.

Iterative Approaches to Solving Large System of Equations

Here’s the problem: you have a system of linear equations, Ax = b, that is too large to solve with Gaussian elimination or whatever real linear algebra toolboxes use. Instead, it would be nice to have an algorithm that you can run for a certain amount of time, stop, and get a result that is a good approximation. There’s a class of algorithms called “Iterative Methods” which do just that, and I want to explain the idea behind them for my own sake.

The key to iterative methods is splitting A into two parts: D and R (or S and T, or P and P-A, nobody has the same names for these matrices). Now we have the formula: Dx + Rx = b where D + R = A. As a small leap, separate and prefix the x’s in the equation as Dx_a + Rx_b = b. We can start guessing for values of x_b and solve for x_a, and maybe try to tweak x_b until both our x’s are equivalent, but that seems like an ass-backwards way of solving linear equations. But hell, lets give it a shot make our lives easier by solving for x_a:

Dx_a + Rx_b = b

x_a = D^-1 (b – Rx_b)
(by matrix arithmetic)

Now that we have our formula all
spelled out, we notice that there’s a really computationally inefficient inversion, so whatever our separation of D and R is, we want D to be easily invertible (i.e. a diagonal matrix or triangular).

The key to this algorithm being practical is that there are some matrices A, D, and R which have the property that x_a is a better ‘guess’ than x_b, and if you iteratively chain operations together, you can get x_a’s that approach the real x:

for k=1:n, x_k+1 = D^-1(b – Rx_k), end

What we’re seeing is a convergence in our guesses. The property you’re looking for your matrix selections is that the greatest absolute eigenvalue (or spectral radius) of D^-1R is less than 1.

Proof: First I’m going to solve for an error function, e, such that e(k) will tell us how far away x_k (x at iteration k) is from the real x by returning x-x_k. Then I’ll show that the function converges if D^-1R is less than 1.

Original Formula: Dx + Rx = b

Iterative Formula: Dx_k+1 + Rx_k= b

(Original formula) – (Iterative formula) = D(x– x_k+1) + R(x - x_k) = 0

so (x – x_k+1) = D^-1R(x – x_k) or e_k+1 = D^-1Re_k. (by algebra + substitution of terms)

e(k) = e_k = (D^-1R)^ke₀ (by recursively chaining function calls together)

(D^-1R) = QΛQ^-1 where Q is a matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues (by eigendecomposition)

(D^-1R)^k = (QΛQ^-1)^k= QΛ^kQ^-1 (by cancellation of Q^-1Q)

In this form, it is pretty clear that we want the entries of Λ to be less than zero so the matrices entries will go to zero as k increases.

So to recap, if the spectral radius of (D^-1R) is less than 1, we have an efficient way of converging on a correct solution for Ax = b. But how do we select D and R?

Here is where the technique diverges:

Jacobi’s Method: D is the diagonal of A

Gauss-Seidel Method: D is the triangular part of A

Successive Overrelaxation: Combination of Jacobi and Gauss-Seidel.

I’ve included the Matlab code I wrote for Jacobi’s Method:

A = [2 -1; -1 2];

b = [2; 2]; % Solution to this linear equation is [2 2]

S = diag(diag(A)); % Precondition

T = S - A;

S_inv = (1./S); %easy to calculate inverse, that's the point.

S_inv(isinf(S_inv)) = 0;

M = S_inv*T;

curr_x = zeros(2,1); next_x = zeros(2,1);

for k=1:10,

next_x = M*curr_x + M*b;

curr_x = next_x;

end

I was expecting 2,2 and after 10 iterations I got 1.998,1.998. Good stuff! As an aside, you can look at the matrix (D^-1R) and tell from its entries (if they’re less than 1) whether or not the matrix will converge. Here's a graph with the residuals (by looking at the (D^-1R)'s entries and seeing that they're all 1/2, you can predict the 50% decline).

I also wrote some code for Gauss-Seidel, but I think you’ve got the idea.

Tuesday, May 10, 2011

Trying different transforms

So I tried different transformations based on the correspondences I got last week. The first I tried was an elastic transformation method:

Left Image/ Right Image/ Absolute difference

And I followed it up with a thin plate spline method:

Left Image/ Right Image/ Absolute difference

Pretty good, but not good enough for scientific purposes.

Tuesday, May 3, 2011

Correspondence Matching Results

It may be hard to see, but these are the correspondences found by the Okutomi algorithm. They're not 100% accurate, but its pretty close, and it can be improved by expanding the template-matching window size.

Unfortunately, the algorithm is pretty slow. I can find these 100 correspondences in between 5-40 seconds (depending on locality of the points), so finding a dense mapping (~800*800 points) is infeasible. I've tried all optimizations short of compiler options (which never give the light-and-day differences you'd hope for). I'm considering shrinking the image.

The biggest challenges ahead are the transformation and the performance. These problems are pretty tightly related, so I'll start with the transformation. A homography does terribly (obviously...), but I picked up some literature on the subject at the library today: http://www.imgfsr.com/ifsr_book.html, and downloaded some sample code: http://biocomp.cnb.csic.es/~iarganda/bUnwarpJ/ (two camera view and in java). I'm currently wrapping my head around non-rigid transformations so I can weigh my options and decide where to go next.