Tyler's Blog: Conjugate Gradient Part 1

Since my project is winding down fast, I thought I'd just expand more on the math behind the scenes.

The conjugate gradient method is an iterative algorithm that can solve equations of the form Ax = b, or be stopped half-way through to get a good estimate for x. It only works if A is positive-definite and square. It is only practical if A is sparse so multiplications by A are easy to compute.

Lets start with a quick definition:

Def: u and v are conjugate
w.r.t. A if u^TAv = 0. Basically, u and Av are orthogonal.

The problem is solving for Ax = b. x can be
broken down into a linear combination of basis of Rⁿ called {d_k}
or ‘directions’ that are all conjugate w.r.t to A. Whenever we talk about linear combinations, I like to think of each basis vector as a direction. Bear with me for a second while I glaze over why it is cool if they’re conjugate. Now take the equation:

x = ∑α_id_i (by linear combination def)

b = Ax = ∑α_iAd_i (by problem definition)

This isn’t terribly interesting, it is just rewriting Ax as A times a completely arbitrary decomposition of x into {d_k}. But imagine we know all the directions in {d_k} and we just wanted to solve for the weight α. Normally, this wouldn’t tell us anything: I can break x in R³ into α₁[1; 0; 0], α₂[0; 1; 0], α₃[0; 0; 1], and I still don’t have a good way to find α’s
and x. But if my bases are conjugate, I can use the following equation:

b = Ax = ∑_i α_iAd_i	(see above)
d_kb = ∑_i d_kα_iAd_i	(multiply both sides by d_k. Note that I know {d_i} so only unkowns are α_i’s.
d_kb = ∑_i d_kα_iAd_i = d_kα_kAd_k	(by the fact that d_iAd_k = 0 if i != k, by conjugate bases)
α_k = d_kb/d_kAd_k

So now we have an easy-to-solve equation for all the weights that make Ax = b true. The direct algorithm can be broken into two steps: find conjugate directions w.r.t A, then efficiently
solve for the α’s, and then add them all together to get x. In practice, all these steps can be intermixed in an iterative algorithm that updates x after every iteration.

Conjugate Gradient Algorithm:

for k=1:n

Find direction d_k

Find α_k

x = x + α_k d_k

But how do we find d_k? One can imagine that there are many different ways to compute a set of conjugate directions, but some may combine to form x in fewer iterations than others. This is where different methods appear such as MINRES, GMRES, and the one I’m concerned with: Conjugate Gradient. In fact, the manner of selecting d_k is where the conjugate gradient gets its name! Imagine we have an energy function: E(x) = ½ x^TAx - x^Tb + c, then the gradient (derivative) E’(x) = Ax – b, and we have a minimum for E(x) at 0 = b - Ax or b = Ax (basic calculus). If you were writing an iterative algorithm that finds the minimum of E(x) without solving directly for x, then you might just greedily follow the gradient until you hit the minimum, which is just: b – Ax where x is your current guess for x. Take note that b – Ax is the gradient and the residue in this special case, and the terms get intermixed frequently. If we just followed this greedy algorithm, we’d be doing a steepest descent algorithm. Unfortunately, these bases aren’t orthogonal or conjugate, and this
leads to inefficiency.

We’ll borrow the idea behind steepest descent to pick our first direction, that is, just say x = 0, plug it into b – Ax, and get b as our first derivative. For the rest of the directions, form conjugate bases to the previous bases with the following rules:

d₀ = b

d_k = r_k-1 - ∑_i<k((d_iAr_{k 1})/(d^T_iAd_i)) d_i

Basically, d_k is the previous derivative orthogonalized (or rather, conjugatilized because we multiply by A) with all d_i for i<k. This should feel a lot like Gram-Schmidt orthonormalization. Here’s an updated outline of the algorithm:

Conjugate Gradient Algorithm

for k=1:n

Find direction d_k

for i=1:k

‘conjugatilize’ d_k
with d_i

Find α_k

x = x + α_k d_k

So as a recap, now we have a good heuristic way of picking directions (d_k) such that we can exploit their conjugate property and solve for their weights (α_k).

The algorithm listed above captures all the intuition of conjugate gradient, but doesn’t look anything like what is actually implemented. The reason is that the inner for-loop can be optimized away along with the requirement to store all previous bases. I might expand on
this idea in a later post.

This is the more common form:

Conjugate Gradient Algorithm

for k=1:n

r_k+1 = r_k - α_k Ad_k

β_k = (r_k+1^Tr_k+1)/(r_k^Tr_k)

d_k+1 = r_k+1 + β_kd_k

α_k+1 = (r_k^Tr_k)/(d_k^TAd_k)

x_k+1 = x_k + α_kd_k

Gradient Descent

Given an energy function of the form E(x) = ½x^TAx - x^Tb + c, our task is to minimize E(x) without directly solving for x because it would be too expensive.

Tyler's Blog

Tuesday, May 31, 2011

Conjugate Gradient Part 1

No comments:

Post a Comment