Last-in, first-out (LIFO): evict the block that has been in the

cache the shortest time.

Least Recently Used (LRU): evict the block whose last use is

furthest in the past.

Least Frequently Used (LFU): evict the block that has been

accessed the fewest times, breaking ties by choosing the block that

has been in the cache the longest.

To analyze these algorithms, we assume that the cache starts out

empty, so that no evictions occur during the first k requests. We wish to

compare the performance of an online algorithm to an optimal offline

algorithm that knows the future requests. As we will soon see, all these

deterministic online algorithms have a lower bound of Ω( k) for their

competitive ratio. Some deterministic algorithms also have a

competitive ratio with an O( k) upper bound, but some other

deterministic algorithms are considerably worse, having a competitive

ratio of Θ( n/ k).

We now proceed to analyze the LIFO and LRU policies. In addition

to assuming that n > k, we will assume that at least k distinct blocks are requested. Otherwise, the cache never fills up and no blocks are evicted,

so that all algorithms exhibit the same behavior. We begin by showing that LIFO has a large competitive ratio.

Theorem 27.2

LIFO has a competitive ratio of Θ( n/ k) for the online caching problem

with n requests and a cache of size k.

Proof We first show a lower bound of Ω( n/ k). Suppose that the input consists of k + 1 blocks, numbered 1, 2, … , k + 1, and the request sequence is

1, 2, 3, 4, … , k, k + 1, k, k + 1, k, k + 1, … , where after the initial 1, 2, … , k, k + 1, the remainder of the sequence alternates between k and k + 1, with a total of n requests. The sequence ends on block k if n and k are either both even or both odd, and otherwise, the sequence ends on block k+1. That is, bi = i for i = 1, 2, …

k−1, bi = k+1 for i = k+1, k+3, … and bi = k for i = k, k + 2, …. How many blocks does LIFO evict? After the first k requests (which are

considered to be cache misses), the cache is filled with blocks 1, 2, … , k.

The ( k + 1)st request, which is for block k + 1, causes block k to be evicted. The ( k + 2)nd request, which is for block k, forces block k + 1

to be evicted, since that block was just placed into the cache. This

behavior continues, alternately evicting blocks k and k+1 for the remaining requests. LIFO, therefore, suffers a cache miss on every one

of the n requests.

The optimal offline algorithm knows the entire sequence of requests

in advance. Upon the first request of block k + 1, it just evicts any block

except block k, and then it never evicts another block. Thus, the optimal

offline algorithm evicts only once. Since the first k requests are

considered cache misses, the total number of cache misses is k + 1. The

competitive ratio, therefore, is n/( k + 1), or Ω( n/ k).

For the upper bound, observe that on any input of size n, any

caching algorithm incurs at most n cache misses. Because the input

contains at least k distinct blocks, any caching algorithm, including the

optimal offline algorithm, must incur at least k cache misses. Therefore,

LIFO has a competitive ratio of O( n/ k).

Image 928

Image 929

We call such a competitive ratio unbounded, because it grows with

the input size. Exercise 27.3-2 asks you to show that LFU also has an

unbounded competitive ratio.

FIFO and LRU have a much better competitive ratio of Θ( k). There

is a big difference between competitive ratios of Θ( n/ k) and Θ( k). The cache size k is independent of the input sequence and does not grow as

more requests arrive over time. A competitive ratio that depends on n,

on the other hand, does grow with the size of the input sequence and

thus can get quite large. It is preferable to use an algorithm with a

competitive ratio that does not grow with the input sequence’s size,

when possible.

We now show that LRU has a competitive ratio of Θ( k), first

showing the upper bound.

Theorem 27.3

LRU has a competitive ratio of O( k) for the online caching problem with n requests and a cache of size k.

Proof To analyze LRU, we will divide the sequence of requests into epochs. Epoch 1 begins with the first request. Epoch i, for i > 1, begins upon encountering the ( k + 1)st distinct request since the beginning of

epoch i − 1. Consider the following example of requests with k = 3:

The first k = 3 distinct requests are for blocks 1, 2 and 5, so epoch 2

begins with the first request for block 4. In epoch 2, the first 3 distinct

requests are for blocks 4, 1, and 2. Requests for these blocks recur until

the request for block 3, and with this request epoch 3 begins. Thus, this

example has four epochs:

Now we consider the behavior of LRU. In each epoch, the first time

a request for a particular block appears, it may cause a cache miss, but

subsequent requests for that block within the epoch cannot cause a

cache miss, since the block is now one of the k most recently used. For

example, in epoch 2, the first request for block 4 causes a cache miss, but the subsequent requests for block 4 do not. (Exercise 27.3-1 asks

you to show the contents of the cache after each request.) In epoch 3,

requests for blocks 3 and 5 cause cache misses, but the request for block

4 does not, because it was recently accessed in epoch 2. Since only the

first request for a block within an epoch can cause a cache miss and the

cache holds k blocks, each epoch incurs at most k cache misses.

Now consider the behavior of the optimal algorithm. The first

request in each epoch must cause a cache miss, even for an optimal

algorithm. The miss occurs because, by the definition of an epoch, there

must have been k other blocks accessed since the last access to this block.

Since, for each epoch, the optimal algorithm incurs at least one miss

and LRU incurs at most k, the competitive ratio is at most k/1 = O( k).

Exercise 27.3-3 asks you to show that FIFO also has a competitive

ratio of O( k).

We could show lower bounds of Ω( k) on LRU and FIFO, but in fact,

we can make a much stronger statement: any deterministic online

caching algorithm must have a competitive ratio of Ω( k). The proof

relies on an adversary who knows the online algorithm being used and

can tailor the future requests to cause the online algorithm to incur

more cache misses than the optimal offline algorithm.

Consider a scenario in which the cache has size k and the set of

possible blocks to request is {1, 2, … , k + 1}. The first k requests are for blocks 1, 2, … , k, so that both the adversary and the deterministic

online algorithm place these blocks into the cache. The next request is

for block k + 1. In order to make room in the cache for block k + 1, the online algorithm evicts some block b 1 from the cache. The adversary,

knowing that the online algorithm has just evicted block b 1, makes the

next request be for b 1, so that the online algorithm must evict some other block b 2 to clear room in the cache for b 1. As you might have guessed, the adversary makes the next request be for block b 2, so that

the online algorithm evicts some other block b 3 to make room for b 2.

Image 930

Image 931

The online algorithm and the adversary continue in this manner. The

online algorithm incurs a cache miss on every request and therefore

incurs n cache misses over the n requests.

Now let’s consider an optimal offline algorithm, which knows the

future. As discussed in Section 15.4, this algorithm is known as furthest-in-future, and it always evicts the block whose next request is furthest in

the future. Since there are only k + 1 unique blocks, when furthest-in-

future evicts a block, we know that it will not be accessed during at least

the next k requests. Thus, after the first k cache misses, the optimal algorithm incurs a cache miss at most once every k requests. Therefore,

the number of cache misses over n requests is at most k + n/ k.

Since the deterministic online algorithm incurs n cache misses and

the optimal offline algorithm incurs at most k + n/ k cache misses, the competitive ratio is at least

For nk 2, the above expression is at least

Thus, for sufficiently long request sequences, we have shown the

following:

Theorem 27.4

Any deterministic online algorithm for caching with a cache size of k

has competitive ratio Ω( k).

Although we can analyze the common caching strategies from the

point of view of competitive analysis, the results are somewhat

unsatisfying. Yes, we can distinguish between algorithms with a

competitive ratio of Θ( k) and those with unbounded competitive ratios.

In the end, however, all of these competitive ratios are rather high. The

online algorithms we have seen so far are deterministic, and it is this

property that the adversary is able to exploit.

27.3.2 Randomized caching algorithms

If we don’t limit ourselves to deterministic online algorithms, we can use

randomization to develop an online caching algorithm with a

significantly smaller competitive ratio. Before describing the algorithm,

let’s discuss randomization in online algorithms in general. Recall that

we analyze online algorithms with respect to an adversary who knows

the online algorithm and can design requests knowing the decisions

made by the online algorithm. With randomization, we must ask

whether the adversary also knows the random choices made by the

online algorithm. An adversary who does not know the random choices

is oblivious, and an adversary who knows the random choices is

nonoblivious. Ideally, we prefer to design algorithms against a

nonoblivious adversary, as this adversary is stronger than an oblivious

one. Unfortunately, a nonoblivious adversary mitigates much of the

power of randomness, as an adversary who knows the outcome of

random choices typically can act as if the online algorithm is

deterministic. The oblivious adversary, on the other hand, does not

know the random choices of the online algorithm, and that is the

adversary we typically use.

As a simple illustration of the difference between an oblivious and

nonoblivious adversary, imagine that you are flipping a fair coin n times,

and the adversary wants to know how many heads you flipped. A

nonoblivious adversary knows, after each flip, whether the coin came up

heads or tails, and hence knows how many heads you flipped. An

oblivious adversary, on the other hand, knows only that you are flipping

a fair coin n times. The oblivious adversary, therefore, can reason that

the number of heads follows a binomial distribution, so that the

expected number of heads is n/2 (by equation (C.41) on page 1199) and

the variance is n/4 (by equation (C.44) on page 1200). But the oblivious

adversary has no way of knowing exactly how many heads you actually

flipped.

Let’s return to caching. We’ll start with a deterministic algorithm

and then randomize it. The algorithm we’ll use is an approximation of

LRU called MARKING. Rather than “least recently used,” think of

MARKING as simply “recently used.” MARKING maintains a 1-bit

attribute mark for each block in the cache. Initially, all blocks in the cache are unmarked. When a block is requested, if it is already in the

cache, it is marked. If the request is a cache miss, MARKING checks to

see whether there are any unmarked blocks in the cache. If all blocks are

marked, then they are all changed to unmarked. Now, regardless of

whether all blocks in the cache were marked when the request occurred,

there is at least one unmarked block in the cache, and so an arbitrary

unmarked block is evicted, and the requested block is placed into the

cache and marked.

How should the block to evict from among the unmarked blocks in

the cache be chosen? The procedure RANDOMIZED-MARKING on

the next page shows the process when the block is chosen randomly. The

procedure takes as input a block b being requested.

RANDOMIZED-MARKING( b)

1if block b resides in the cache,

2

b.mark = 1

3else

4

if all blocks b′ in the cache have b′. mark = 1

5

unmark all blocks b′ in the cache, setting b′. mark = 0

6

select an unmarked block u with u. mark = 0 uniformly at random 7

evict block u

8

place block b into the cache

9

b. mark = 1

For the purpose of analysis, we say that a new epoch begins

immediately after each time line 5 executes. An epoch starts with no

marked blocks in the cache. The first time a block is requested during an

epoch, the number of marked blocks increases by 1, and any subsequent

requests to that block do not change the number of marked blocks.

Therefore, the number of marked blocks monotonically increases within

an epoch. Under this view, epochs are the same as in the proof of

Theorem 27.3: with a cache that holds k blocks, an epoch comprises

requests for k distinct blocks (possibly fewer for the final epoch), and the next epoch begins upon a request for a block not in those k.

Image 932

Because we are going to analyze a randomized algorithm, we will

compute the expected competitive ratio. Recall that for an input I, we

denote the solution value of an online algorithm A by A( I) and the solution value of an optimal algorithm F by F( I). Online algorithm A has an expected competitive ratio c if for all inputs I, we have where the expectation is taken over the random choices made by A.

Although the deterministic MARKING algorithm has a competitive

ratio of Θ( k) (Theorem 27.4 provides the lower bound and see Exercise

27.3-4 for the upper bound), RANDOMIZED-MARKING has a much

smaller expected competitive ratio, namely O(lg k). The key to the improved competitive ratio is that the adversary cannot always make a

request for a block that is not in the cache, since an oblivious adversary

does not know which blocks are in the cache.

Theorem 27.5

RANDOMIZED-MARKING has an expected competitive ratio of

O(lg k) for the online caching problem with n requests and a cache of size k, against an oblivious adversary.

Before proving Theorem 27.5, we prove a basic probabilistic fact.

Lemma 27.6

Suppose that a bag contains x + y balls: x − 1 blue balls, y white balls, and 1 red ball. You repeatedly choose a ball at random and remove it

from the bag until you have chosen a total of m balls that are either blue

or red, where mx. You set aside each white ball you choose. Then, one of the balls chosen is the red ball with probability m/ x.

Proof Choosing a white ball does not affect how many blue or red balls

are chosen in any way. Therefore, we can continue the analysis as if

there were no white balls and the bag contains just x − 1 blue balls and

1 red ball.

Let A be the event that the red ball is not chosen, and let Ai be the

event that the i th draw does not choose the red ball. By equation (C.22)

on page 1190, we have

Image 933

Image 934

Image 935

The probability Pr{ A 1} that the first ball is blue equals ( x − 1)/ x, since initially there are x − 1 blue balls and 1 red ball. More generally, we have

since the i th draw is from xi blue balls and 1 red ball. Equations (27.13) and (27.14) give

The right-hand side of equation (27.15) is a telescoping product, similar

to the telescoping series in equation (A.12) on page 1143. The

numerator of one term equals the denominator of the next, so that

everything except the first denominator and last numerator cancel, and

we obtain Pr{ A} = ( xm)/ x. Since we actually want to compute Pr{ Ā}

= 1 − Pr{ A}, that is, the probability that the red ball is chosen, we get Pr{ Ā} = 1 − ( xm)/ x = m/ x.

Now we can prove Theorem 27.5.

Proof We’ll analyze RANDOMIZED-MARKING one epoch at a

time. Within epoch i, any request for a block b that is not the first request for block b in epoch i must result in a cache hit, since after the first request in epoch i, block b resides in the cache and is marked, so that it cannot be evicted during the epoch. Therefore, since we are

counting cache misses, we’ll consider only the first request for each

block within each epoch, disregarding all other requests.

We can classify the requests in an epoch as either old or new. If block

b resides in the cache at the start of epoch i, each request for block b during epoch i is an old request. Old requests in epoch i are for blocks requested in epoch i − 1. If a request in epoch i is not old, it is a new

request, and it is for a block not requested in epoch i − 1. All requests in epoch 1 are new. For example, let’s look again at the request sequence in

example (27.11):

1, 2, 1, 5

4, 4, 1, 2, 4, 2

3, 4, 5 2, 2, 1, 2, 2.

Since we can disregard all requests for a block within an epoch other

than the first request, to analyze the cache behavior, we can view this

request sequence as just

1, 2, 5

4, 1, 2

3, 4, 5

2, 1.

All three requests in epoch 1 are new. In epoch 2, the requests for blocks

1 and 2 are old, but the request for block 4 is new. In epoch 3, the

request for block 4 is old, and the requests for blocks 3 and 5 are new.

Both requests in epoch 4 are new.

Within an epoch, each new request must cause a cache miss since, by

definition, the block is not already in the cache. An old request, on the

other hand, may or may not cause a cache miss. The old block is in the

cache at the beginning of the epoch, but other requests might cause it to

be evicted. Returning to our example, in epoch 2, the request for block 4

must cause a cache miss, as this request is new. The request for block 1,

which is old, may or may not cause a cache miss. If block 1 was evicted

when block 4 was requested, then a cache miss occurs and block 1 must

be brought back into the cache. If instead block 1 was not evicted when

block 4 was requested, then the request for block 1 results in a cache hit.

The request for block 2 could incur a cache miss under two scenarios.

One is if block 2 was evicted when block 4 was requested. The other is if

block 1 was evicted when block 4 was requested, and then block 2 was

evicted when block 1 was requested. We see that, within an epoch, each

ensuing old request has an increasing chance of causing a cache miss.

Because we consider only the first request for each block within an

epoch, we assume that each epoch contains exactly k requests, and each

request within an epoch is for a unique block. (The last epoch might

contain fewer than k requests. If it does, just add dummy requests to fill

it out to k requests.) In epoch i, denote the number of new requests by ri

≥ 1 (an epoch must contain at least one new request), so that the

number of old requests is kri. As mentioned above, a new request always incurs a cache miss.

Let us now focus on an arbitrary epoch i to obtain a bound on the

expected number of cache misses within that epoch. In particular, let’s

think about the j th old request within the epoch, where 1 ≤ j < k.

Denote by bij the block requested in the j th old request of epoch i, and denote by nij and oij the number of new and old requests, respectively, that occur within epoch i but before the j th old request. Because j − 1

old requests occur before the j th old request, we have oij = j − 1. We will show that the probability of a cache miss upon the j th old request is

nij/( koij), or nij/( kj + 1).

Start by considering the first old request, for block bi,1. What is the

probability that this request causes a cache miss? It causes a cache miss

precisely when one of the ni,1 previous requests resulted in bi,1 being evicted. We can determine the probability that bi,1 was chosen for

eviction by using Lemma 27.6: consider the k blocks in the cache to be k

balls, with block bi,1 as the red ball, the other k − 1 blocks as the k − 1

blue balls, and no white balls. Each of the ni,1 requests chooses a block

to evict with equal probability, corresponding to drawing balls ni,1

times. Thus, we can apply Lemma 27.6 with x = k, y = 0, and m = ni,1, deriving the probability of a cache miss upon the first old request as

ni,1/ k, which equals nij/( kj + 1) since j = 1.

In order to determine the probability of a cache miss for subsequent

old requests, we’ll need an additional observation. Let’s consider the

second old request, which is for block bi,2. This request causes a cache

miss precisely when one of the previous requests evicts bi,2. Let’s

consider two cases, based on the request for bi,1. In the first case, suppose that the request for bi,1 did not cause an eviction, because bi,1

was already in the cache. Then, the only way that bi,2 could have been

evicted is by one of the ni,2 new requests that precedes it. What is the

probability that this eviction happens? There are ni,2 chances for bi,2 to

be evicted, but we also know that there is one block in the cache, namely bi,1, that is not evicted. Thus, we can again apply Lemma 27.6, but with

bi,1 as the white ball, bi,2 as the red ball, the remaining blocks as the blue balls, and drawing balls ni,2 times. Applying Lemma 27.6, with x =

k − 1, y = 1, and m = ni,2, we find that the probability of a cache miss is ni,2/( k − 1). In the second case, the request for bi,1 does cause an eviction, which can happen only if one of the new requests preceding the

request for bi,1 evicts bi,1. Then, the request for bi,1 brings bi,1 back into the cache and evicts some other block. In this case, we know that of

the new requests, one of them did not result in bi,2 being evicted, since

bi,1 was evicted. Therefore, ni,2 − 1 new requests could evict bi,2, as could the request for bi,1, so that the number of requests that could evict bi,2 is ni,2. Each such request evicts a block chosen from among k

− 1 blocks, since the request that resulted in evicting bi,1 did not also

cause bi,2 to be evicted. Therefore, we can apply Lemma 27.6, with x =

k − 1, y = 1, and m = ni,2, and get that the probability of a miss is ni,2/( k − 1). In both cases the probability is the same, and it equals nij/( k

j + 1) since j = 2.

More generally, oij old requests occur before the j th old request.

Each of these prior old requests either caused an eviction or did not.

For those that caused an eviction, it is because they were evicted by a

previous request, and for those that did not cause an eviction, it is

because they were not evicted by any previous request. In either case, we

can decrease the number of blocks that the random process is choosing

from by 1 for each old request, and thus oij requests cannot cause bij to be evicted. Therefore, we can use Lemma 27.6 to determine the

probability that bij was evicted by a previous request, with x = koij, y

= oij and m = nij. Thus, we have proven our claim that the probability of a cache miss on the j th request for an old block is nij/( koij), or nij/( k

j + 1). Since nijri (recall that ri is the number of new requests during

Image 936

Image 937

Image 938

epoch i), we have an upper bound of ri/( kj + 1) on the probability that the j th old request incurs a cache miss.

We can now compute the expected number of misses during epoch i

using indicator random variables, as introduced in Section 5.2. We define indicator random variables

Yij = I{the j th old request in epoch i incurs a cache miss},

Zij = I{the j th new request in epoch i incurs a cache miss}.

We have Zij = 1 for j = 1, 2, … , ri, since every new request results in a cache miss. Let Xi be the random variable denoting the number of cache

misses during epoch i, so that

and so

where Hk is the k th harmonic number.

To compute the expected total number of cache misses, we sum over

all epochs. Let p denote the number of epochs and X be the random variable denoting the number of cache misses. Then, we have

,

so that

Image 939

Image 940

Image 941

To complete the analysis, we need to understand the behavior of the

optimal offline algorithm. It could make a completely different set of

decisions from those made by RANDOMIZED-MARKING, and at

any point its cache may look nothing like the cache of the randomized

algorithm. Yet, we want to relate the number of cache misses of the

optimal offline algorithm to the value in inequality (27.17), in order to

have a competitive ratio that does not depend on

. Focusing on

individual epochs won’t suffice. At the beginning of any epoch, the

offline algorithm might have loaded the cache with exactly the blocks

that will be requested in that epoch. Therefore, we cannot take any one

epoch in isolation and claim that an offline algorithm must suffer any

cache misses during that epoch.

If we consider two consecutive epochs, however, we can better

analyze the optimal offline algorithm. Consider two consecutive epochs,

i −1 and i. Each contains k requests for k different blocks. (Recall our assumption that all requests are first requests in an epoch.) Epoch i

contains ri requests for new blocks, that is, blocks that were not

requested during epoch i − 1. Therefore, the number of distinct requests

during epochs i−1 and i is exactly k+ ri. No matter what the cache contents were at the beginning of epoch i − 1, after k + ri distinct requests, there must be at least ri cache misses. There could be more, but

there is no way to have fewer. Letting mi denote the number of cache

misses of the offline algorithm during epoch i, we have just argued that

The total number of cache misses of the offline algorithm is

Image 942

Image 943

Image 944

Image 945

The justification m 1 = r 1 for the last equality follows because, by our assumptions, the cache starts out empty and every request incurs a

cache miss in the first epoch, even for the optimal offline adversary.

To conclude the analysis, because we have an upper bound of

on the expected number of cache misses for RANDOMIZED-

MARKING and a lower bound of

on the number of cache

misses for the optimal offline algorithm, the expected competitive ratio

is at most

Exercises

27.3-1

For the cache sequence (27.10), show the contents of the cache after

each request and count the number of cache misses. How many misses

does each epoch incur?

27.3-2

Show that LFU has a competitive ratio of Θ( n/ k) for the online caching

problem with n requests and a cache of size k.

27.3-3

Show that FIFO has a competitive ratio of O( k) for the online caching problem with n requests and a cache of size k.

27.3-4

Show that the deterministic MARKING algorithm has a competitive

ratio of O( k) for the online caching problem with n requests and a cache of size k.

27.3-5

Theorem 27.4 shows that any deterministic online algorithm for caching

has a competitive ratio of Ω( k), where k is the cache size. One way in

which an algorithm might be able to perform better is to have some

ability to know what the next few requests will be. We say that an

algorithm is l-lookahead if it has the ability to look ahead at the next l

requests. Prove that for every constant l ≥ 0 and every cache size k ≥1, every deterministic l-lookahead algorithm has competitive ratio Ω( k).

Problems

27-1 Cow-path problem

The Appalachian Trail (AT) is a marked hiking trail in the eastern

United States extending between Springer Mountain in Georgia and

Mount Katahdin in Maine. The trail is about 2,190 miles long. You

decide that you are going to hike the AT from Georgia to Maine and

back. You plan to learn more about algorithms while on the trail, and

so you bring along your copy of Introduction to Algorithms in your

backpack. 2 You have already read through this chapter before starting out. Because the beauty of the trail distracts you, you forget about

reading this book until you have reached Maine and hiked halfway back

to Georgia. At that point, you decide that you have already seen the

trail and want to continue reading the rest of the book, starting with

Chapter 28. Unfortunately, you find that the book is no longer in your pack. You must have left it somewhere along the trail, but you don’t

know where. It could be anywhere between Georgia and Maine. You

want to find the book, but now that you have learned something about

Image 946

online algorithms, you want your algorithm for finding it to have a good

competitive ratio. That is, no matter where the book is, if its distance

from you is x miles away, you would like to be sure that you do not walk

more than cx miles to find it, for some constant c. You do not know x, though you may assume that x ≥ 1. 3

What algorithm should you use, and what constant c can you prove

bounds the total distance cx that you would have to walk? Your

algorithm should work for a trail of any length, not just the 2,190-mile-

long AT.

27-2 Online scheduling to minimize average completion time

Problem 15-2 discusses scheduling to minimize average completion time

on one machine, without release times and preemption and with release

times and preemption. Now you will develop an online algorithm for

nonpreemptively scheduling a set of tasks with release times. Suppose

you are given a set S = { a 1, a 2, … , an} of tasks, where task ai has release time ri, before which it cannot start, and requires pi units of processing time to complete once it has started. You have one computer

on which to run the tasks. Tasks cannot be preempted, which is to say

that once started, a task must run to completion without interruption.

(See Problem 15-2 on page 446 for a more detailed description of this

problem.) Given a schedule, let Ci be the completion time of task ai, that is, the time at which task ai completes processing. Your goal is to find a

schedule that minimizes the average completion time, that is, to

minimize

.

In the online version of this problem, you learn about task i only

when it arrives at its release time ri, and at that point, you know its processing time pi. The offline version of this problem is NP-hard (see

Chapter 34), but you will develop a 2-competitive online algorithm.

a. Show that, if there are release times, scheduling by shortest processing

time (when the machine becomes idle, start the already released task

with the smallest processing time that has not yet run) is not d-

competitive for any constant d.

Image 947

Image 948

Image 949

Image 950

Image 951

Image 952

In order to develop an online algorithm, consider the preemptive

version of this problem, which is discussed in Problem 15-2(b). One way

to schedule is to run the tasks according to the shortest remaining

processing time (SRPT) order. That is, at any point, the machine is

running the available task with the smallest amount of remaining

processing time.

b. Explain how to run SRPT as an online algorithm.

c. Suppose that you run SRPT and obtain completion times

.

Show that

where the are the completion times in an optimal nonpreemptive

schedule.

Consider the (offline) algorithm COMPLETION-TIME-SCHEDULE.

COMPLETION-TIME-SCHEDULE( S)

1 compute an optimal schedule for the preemptive version of the

problem

2 renumber the tasks so that the completion times in the optimal

preemptive schedule are ordered by their completion times

in SRPT order

3 greedily schedule the tasks nonpreemptively in the renumbered

order a 1, … , an

4 let C 1, … , Cn be the completion times of renumbered tasks a 1, … , an in this nonpreemptive schedule

5 return C 1, … , Cn

d. Prove that

for i = 1, … , n.

e. Prove that

for i = 1, … , n.

f. Algorithm COMPLETION-TIME-SCHEDULE is an offline

algorithm. Explain how to modify it to produce an online algorithm.

g. Combine parts (c)–(f) to show that the online version of

COMPLETION-TIME-SCHEDULE is 2-competitive.

Chapter notes

Online algorithms are widely used in many domains. Some good

overviews include the textbook by Borodin and El-Yaniv [68], the collection of surveys edited by Fiat and Woeginger [142], and the survey by Albers [14].

The move-to-front heuristic from Section 27.2 was analyzed by Sleator and Tarjan [416, 417] as part of their early work on amortized analysis. This rule works quite well in practice.

Competitive analysis of online caching also originated with Sleator

and Tarjan [417]. The randomized marking algorithm was proposed and analyzed by Fiat et al. [141]. Young [464] surveys online caching and paging algorithms, and Buchbinder and Naor [76] survey primal-dual online algorithms.

Specific types of online algorithms are described using other names.

Dynamic graph algorithms are online algorithms on graphs, where at

each step a vertex or edge undergoes modification. Typically a vertex or

edge is either inserted or deleted, or some associated property, such as

edge weight, changes. Some graph problems need to be solved again

after each change to the graph, and a good dynamic graph algorithm

will not need to solve from scratch. For example, edges are inserted and

deleted, and after each change to the graph, the minimum spanning tree

is recomputed. Exercise 21.2-8 asks such a question. Similar questions

can be asked for other graph algorithms, such as shortest paths,

connectivity, or matching. The first paper in this field is credited to Even

and Shiloach [138], who study how to maintain a shortest-path tree as edges are being deleted from a graph. Since then hundreds of papers

have been published. Demetrescu et al. [110] survey early developments in dynamic graph algorithms.

For massive data sets, the input data might be too large to store.

Streaming algorithms model this situation by requiring the memory

used by an algorithm to be significantly smaller than the input size. For

Image 953

example, you may have a graph with n vertices and m edges with mn, but the memory allowed may be only O( n). Or you may have n numbers, but the memory allowed may only be O(lg n) or

. A streaming

algorithm is measured by the number of passes made over the data in

addition to the running time of the algorithm. McGregor [322] surveys streaming algorithms for graphs and Muthukrishnan [341] surveys general streaming algorithms.

1 The path-compression heuristic in Section 19.3 resembles MOVE-TO-FRONT, although it would be more accurately expressed as “move-to-next-to-front.” Unlike MOVE-TO-FRONT in

a doubly linked list, path compression can relocate multiple elements to become “next-to-front.”

2 This book is heavy. We do not recommend that you carry it on a long hike.

3 In case you’re wondering what this problem has to do with cows, some papers about it frame the problem as a cow looking for a field in which to graze.

28 Matrix Operations

Because operations on matrices lie at the heart of scientific computing,

efficient algorithms for working with matrices have many practical

applications. This chapter focuses on how to multiply matrices and solve

sets of simultaneous linear equations. Appendix D reviews the basics of matrices.

Section 28.1 shows how to solve a set of linear equations using LUP

decompositions. Then, Section 28.2 explores the close relationship between multiplying and inverting matrices. Finally, Section 28.3

discusses the important class of symmetric positive-definite matrices

and shows how to use them to find a least-squares solution to an

overdetermined set of linear equations.

One important issue that arises in practice is numerical stability.

Because actual computers have limits to how precisely they can

represent floating-point numbers, round-off errors in numerical

computations may become amplified over the course of a computation,

leading to incorrect results. Such computations are called numerically

unstable. Although we’ll briefly consider numerical stability on

occasion, we won’t focus on it in this chapter. We refer you to the

excellent book by Higham [216] for a thorough discussion of stability issues.

28.1 Solving systems of linear equations

Image 954

Image 955

Image 956

Image 957

Numerous applications need to solve sets of simultaneous linear

equations. A linear system can be cast as a matrix equation in which

each matrix or vector element belongs to a field, typically the real

numbers ℝ. This section discusses how to solve a system of linear

equations using a method called LUP decomposition.

The process starts with a set of linear equations in n unknowns x 1,

x 2, … , xn:

A solution to the equations (28.1) is a set of values for x 1, x 2, … , xn that satisfy all of the equations simultaneously. In this section, we treat

only the case in which there are exactly n equations in n unknowns.

Next, rewrite equations (28.1) as the matrix-vector equation

or, equivalently, letting A = ( aij), x = ( xi), and b = ( bi), as If A is nonsingular, it possesses an inverse A−1, and

is the solution vector. We can prove that x is the unique solution to equation (28.2) as follows. If there are two solutions, x and x′, then Ax

= Ax′ = b and, letting I denote an identity matrix,

x = Ix

= ( A−1 A) x

= A−1( Ax)

=

Image 958

A−1( Ax′)

= ( A−1 A) x

= Ix

= x′.

This section focuses on the case in which A is nonsingular or,

equivalently (by Theorem D.1 on page 1220), the rank of A equals the

number n of unknowns. There are other possibilities, however, which

merit a brief discussion. If the number of equations is less than the

number n of unknowns—or, more generally, if the rank of A is less than

n—then the system is underdetermined. An underdetermined system

typically has infinitely many solutions, although it may have no

solutions at all if the equations are inconsistent. If the number of

equations exceeds the number n of unknowns, the system is

overdetermined, and there may not exist any solutions. Section 28.3

addresses the important problem of finding good approximate solutions

to overdetermined systems of linear equations.

Let’s return to the problem of solving the system Ax = b of n equations in n unknowns. One option is to compute A−1 and then, using equation (28.3), multiply b by A−1, yielding x = A−1 b. This approach suffers in practice from numerical instability. Fortunately,

another approach—LUP decomposition—is numerically stable and has

the further advantage of being faster in practice.

Overview of LUP decomposition

The idea behind LUP decomposition is to find three n × n matrices L, U, and P such that

where

L is a unit lower-triangular matrix,

U is an upper-triangular matrix, and

P is a permutation matrix.

Image 959

Image 960

Image 961

We call matrices L, U, and P satisfying equation (28.4) an LUP

decomposition of the matrix A. We’ll show that every nonsingular matrix A possesses such a decomposition.

Computing an LUP decomposition for the matrix A has the

advantage that linear systems can be efficiently solved when they are

triangular, as is the case for both matrices L and U. If you have an LUP

decomposition for A, you can solve equation (28.2), Ax = b, by solving only triangular linear systems, as follows. Multiply both sides of Ax = b

by P, yielding the equivalent equation PAx = Pb. By Exercise D.1-4 on page 1219, multiplying both sides by a permutation matrix amounts to

permuting the equations (28.1). By the decomposition (28.4),

substituting LU for PA gives

LUx = Pb.

You can now solve this equation by solving two triangular linear

systems. Define y = Ux, where x is the desired solution vector. First, solve the lower-triangular system

for the unknown vector y by a method called “forward substitution.”

Having solved for y, solve the upper-triangular system

for the unknown x by a method called “back substitution.” Why does

this process solve Ax = b? Because the permutation matrix P is invertible (see Exercise D.2-3 on page 1223), multiplying both sides of

equation (28.4) by P −1 gives P−1 PA = P−1 LU, so that Hence, the vector x that satisfies Ux = y is the solution to Ax = b: Ax = P−1 LUx (by equation (28.7))

= P−1 Ly (by equation (28.6))

= P−1 Pb (by equation (28.5))

= b.

Image 962

The next step is to show how forward and back substitution work

and then attack the problem of computing the LUP decomposition

itself.

Forward and back substitution

Forward substitution can solve the lower-triangular system (28.5) in

Θ( n 2) time, given L, P, and b. An array π[1 : n] provides a more compact format to represent the permutation P than an n × n matrix that is mostly 0s. For i = 1, 2, … , n, the entry π[ i] indicates that Pi,π[ i] = 1 and Pij = 0 for jπ[ i]. Thus, PA has [ i], j in row i and column j, and Pb has [ i] as its i th element. Since L is unit lower-triangular, the matrix equation Ly = Pb is equivalent to the n equations

y 1

= [1],

l 21 y 1 + y 2

= [2],

l 31 y 1 + l 32 y 2 + y 3

= [3],

ln 1 y 1 + ln 2 y 2 + ln 3 y 3 + ⋯ + yn = [ n].

The first equation gives y 1 = [1] directly. Knowing the value of y 1, you can substitute it into the second equation, yielding

y 2 = [2] − l 21 y 1.

Next, you can substitute both y 1 and y 2 into the third equation, obtaining

y 3 = [3] − ( l 31 y 1 + l 32 y 2).

In general, you substitute y 1, y 2, … , yi−1 “forward” into the i th equation to solve for yi:

Image 963

Once you’ve solved for y, you can solve for x in equation (28.6) using

back substitution, which is similar to forward substitution. This time, you solve the n th equation first and work backward to the first

equation. Like forward substitution, this process runs in Θ( n 2) time.

Since U is upper-triangular, the matrix equation Ux = y is equivalent to the n equations

u 11 x 1 + u 12 x 2 + u 1, n−2 xn−2 + u 1, n−1 xn−1 +

u 1 nxn = y 1,

⋯ +

u 22 x 2 + ⋯ + u 2, n−2 xn−2 + u 2, n−1 xn−1 +

u 2 nxn = y 2,

un−2, n−2 xn−2 un−2, n−1 xn−1 un−2, nxn = yn−2,

+

+

un−1, n−1 xn−1 un−1, nxn = yn−1,

+

un,nxn = yn.

Thus, you can solve for xn, xn−1, … , x 1 successively as follows: xn

= yn/ un,n,

xn−1 = ( yn−1 − un−1, nxn)/ un−1, n−1,

xn−2 = ( yn−2 − ( un−2, n−1 xn−1 + un−2, nxn))/ un−2, n−2,

or, in general,

Given P, L, U, and b, the procedure LUP-SOLVE on the next page solves for x by combining forward and back substitution. The

permutation matrix P is represented by the array π. The procedure first solves for y using forward substitution in lines 2–3, and then it solves for

x using backward substitution in lines 4–5. Since the summation within

Image 964

Image 965

Image 966

Image 967

Image 968

Image 969

Image 970

each of the for loops includes an implicit loop, the running time is

Θ( n 2).

As an example of these methods, consider the system of linear

equations defined by Ax = b, where

LUP-SOLVE( L, U, π, b, n)

1 let x and y be new vectors of length n

2 for i = 1 to n

3

4 for i = n downto 1

5

6 return x

and we want to solve for the unknown x. The LUP decomposition is

(You might want to verify that PA = LU.) Using forward substitution,

solve Ly = Pb for y:

obtaining

by computing first y 1, then y 2, and finally y 3. Then, using back substitution, solve Ux = y for x:

Image 971

Image 972

thereby obtaining the desired answer

by computing first x 3, then x 2, and finally x 1.

Computing an LU decomposition

Given an LUP decomposition for a nonsingular matrix A, you can use

forward and back substitution to solve the system Ax = b of linear equations. Now let’s see how to efficiently compute an LUP

decomposition for A. We start with the simpler case in which A is an n ×

n nonsingular matrix and P is absent (or, equivalently, P = In, the n × n identity matrix), so that A = LU. We call the two matrices L and U an LU decomposition of A.

To create an LU decomposition, we’ll use a process known as

Gaussian elimination. Start by subtracting multiples of the first equation

from the other equations in order to remove the first variable from those

equations. Then subtract multiples of the second equation from the

third and subsequent equations so that now the first and second

variables are removed from them. Continue this process until the system

that remains has an upper-triangular form—this is the matrix U. The

matrix L comprises the row multipliers that cause variables to be

eliminated.

To implement this strategy, let’s start with a recursive formulation.

The input is an n × n nonsingular matrix A. If n = 1, then nothing needs to be done: just choose L = I 1 and U = A. For n > 1, break A into four parts:

Image 973

Image 974

Image 975

where v = ( a 21, a 31, … , an 1) is a column ( n−1)-vector, w T = ( a 12, a 13,

… , a 1 n)T is a row ( n − 1)-vector, and A′ is an ( n − 1) × ( n − 1) matrix.

Then, using matrix algebra (verify the equations by simply multiplying

through), factor A as

The 0s in the first and second matrices of equation (28.9) are row and

column ( n − 1)-vectors, respectively. The term vw T/ a 11 is an ( n − 1) × ( n

− 1) matrix formed by taking the outer product of v and w and dividing

each element of the result by a 11. Thus it conforms in size to the matrix

A′ from which it is subtracted. The resulting ( n − 1) × ( n − 1) matrix is called the Schur complement of A with respect to a 11.

We claim that if A is nonsingular, then the Schur complement is

nonsingular, too. Why? Suppose that the Schur complement, which is ( n

− 1) × ( n − 1), is singular. Then by Theorem D.1, it has row rank strictly

less than n − 1. Because the bottom n − 1 entries in the first column of the matrix

are all 0, the bottom n − 1 rows of this matrix must have row rank strictly less than n − 1. The row rank of the entire matrix, therefore, is

strictly less than n. Applying Exercise D.2-8 on page 1223 to equation

(28.9), A has rank strictly less than n, and from Theorem D.1, we derive the contradiction that A is singular.

Because the Schur complement is nonsingular, it, too, has an LU

decomposition, which we can find recursively. Let’s say that

A′ − vw T/ a 11 = LU′,

Image 976

Image 977

where L′ is unit lower-triangular and U′ is upper-triangular. The LU

decomposition of A is then A = LU, with

as shown by

Because L′ is unit lower-triangular, so is L, and because U′ is upper-triangular, so is U.

Of course, if a 11 = 0, this method doesn’t work, because it divides by

0. It also doesn’t work if the upper leftmost entry of the Schur

complement A′ − vw T/ a 11 is 0, since the next step of the recursion will divide by it. The denominators in each step of LU decomposition are

called pivots, and they occupy the diagonal elements of the matrix U.

The permutation matrix P included in LUP decomposition provides a

way to avoid dividing by 0, as we’ll see below. Using permutations to

avoid division by 0 (or by small numbers, which can contribute to

numerical instability), is called pivoting.

An important class of matrices for which LU decomposition always

works correctly is the class of symmetric positive-definite matrices. Such

matrices require no pivoting to avoid dividing by 0 in the recursive

strategy outlined above. We will prove this result, as well as several

others, in Section 28.3.

The pseudocode in the procedure LU-DECOMPOSITION follows

the recursive strategy, except that an iteration loop replaces the

recursion. (This transformation is a standard optimization for a “tail-

recursive” procedure—one whose last operation is a recursive call to

itself. See Problem 7-5 on page 202.) The procedure initializes the

matrix U with 0s below the diagonal and matrix L with 1s on its diagonal and 0s above the diagonal. Each iteration works on a square

submatrix, using its upper leftmost element as the pivot to compute the

v and w vectors and the Schur complement, which becomes the square

submatrix worked on by the next iteration.

LU-DECOMPOSITION( A, n)

1let L and U be new n × n matrices

2initialize U with 0s below the diagonal

3initialize L with 1s on the diagonal and 0s above the diagonal

4for k = 1 to n

5

ukk = akk

6

for i = k + 1 to n

7

lik = aik/ akk

// aik holds vi

8

uki = aki

// aki holds wi

9

for i = k + 1 to n

// compute the Schur complement …

10

for j = k + 1 to n

11

aij = aijlikukj

// … and store it back into A

12return L and U

Each recursive step in the description above takes place in one

iteration of the outer for loop of lines 4–11. Within this loop, line 5

determines the pivot to be ukk = akk. The for loop in lines 6–8 (which

does not execute when k = n) uses the v and w vectors to update L and U. Line 7 determines the below-diagonal elements of L, storing vi/ akk in lik, and line 8 computes the above-diagonal elements of U, storing wi in uki. Finally, lines 9–11 compute the elements of the Schur

complement and store them back into the matrix A. (There is no need

to divide by akk in line 11 because that already happened when line 7

computed lik.) Because line 11 is triply nested, LU-DECOMPOSITION

runs in Θ( n 3) time.

Figure 28.1 illustrates the operation of LU-DECOMPOSITION. It

shows a standard optimization of the procedure that stores the

Image 978

significant elements of L and U in place in the matrix A. Each element aij corresponds to either lij (if i > j) or uij (if ij), so that the matrix A holds both L and U when the procedure terminates. To obtain the pseudocode for this optimization from the pseudocode for the LU-DECOMPOSITION procedure, just replace each reference to l or u by

a. You can verify that this transformation preserves correctness.

Figure 28.1 The operation of LU-DECOMPOSITION. (a) The matrix A. (b) The result of the first iteration of the outer for loop of lines 4–11. The element a 11 = 2 highlighted in blue is the pivot, the tan column is v/ a 11, and the tan row is w T. The elements of U computed thus far are above the horizontal line, and the elements of L are to the left of the vertical line. The Schur complement matrix A′ − vw T/ a 11 occupies the lower right. (c) The result of the next iteration of the outer for loop, on the Schur complement matrix from part (b). The element a 22 = 4

highlighted in blue is the pivot, and the tan column and row are v/ a 22 and w T (in the partitioning of the Schur complement), respectively. Lines divide the matrix into the elements of U computed so far (above), the elements of L computed so far (left), and the new Schur complement (lower right). (d) After the next iteration, the matrix A is factored. The element 3 in the new Schur complement becomes part of U when the recursion terminates.) (e) The factorization A = LU.

Computing an LUP decomposition

If the diagonal of the matrix given to LU-DECOMPOSITION contains

any 0s, then the procedure will attempt to divide by 0, which would

cause disaster. Even if the diagonal contains no 0s, but does have

numbers with small absolute values, dividing by such numbers can cause

Image 979

Image 980

numerical instabilities. Therefore, LUP decomposition pivots on entries

with the largest absolute values that it can find.

In LUP decomposition, the input is an n × n nonsingular matrix A, with a goal of finding a permutation matrix P, a unit lower-triangular

matrix L, and an upper-triangular matrix U such that PA = LU. Before partitioning the matrix A, as LU decomposition does, LUP

decomposition moves a nonzero element, say ak 1, from somewhere in

the first column to the (1, 1) position of the matrix. For the greatest

numerical stability, LUP decomposition chooses the element in the first

column with the greatest absolute value as ak 1. (The first column

cannot contain only 0s, for then A would be singular, because its

determinant would be 0, by Theorems D.4 and D.5 on page 1221.) In

order to preserve the set of equations, LUP decomposition exchanges

row 1 with row k, which is equivalent to multiplying A by a permutation matrix Q on the left (Exercise D.1-4 on page 1219). Thus, the analog to

equation (28.8) expresses QA as

where v = ( a 21, a 31, … , an 1), except that a 11 replaces ak 1; w T = ( ak 2, ak 3, … , akn)T; and A′ is an ( n − 1) × ( n − 1) matrix. Since ak 1 ≠ 0, the analog to equation (28.9) guarantees no division by 0:

Just as in LU decomposition, if A is nonsingular, then the Schur

complement A′ − vw T/ ak 1 is nonsingular, too. Therefore, you can recursively find an LUP decomposition for it, with unit lower-triangular

matrix L′, upper-triangular matrix U′, and permutation matrix P′, such that

P′( A′ − vw T/ ak 1) = LU′.

Image 981

Image 982

Define

which is a permutation matrix, since it is the product of two

permutation matrices (Exercise D.1-4 on page 1219). This definition of

P gives

which yields the LUP decomposition. Because L′ is unit lower-

triangular, so is L, and because U′ is upper-triangular, so is U.

Notice that in this derivation, unlike the one for LU decomposition,

both the column vector v/ ak 1 and the Schur complement A′ − vw T/ ak 1

are multiplied by the permutation matrix P′. The procedure LUP-

DECOMPOSITION gives the pseudocode for LUP decomposition.

LUP-DECOMPOSITION( A, n)

1 let π[1 : n] be a new array

2 for i = 1 to n

3

π[ i] = i

// initialize π to the identity permutation

4 for k = 1 to n

5

p = 0

6

for i = k to n

// find largest absolute value in column

k

7

if | aik| > p

8

Image 983

p = | aik|

9

k′ = i

// row number of the largest found so

far

10

if p == 0

11

error “singular matrix”

12

exchange π[ k] with π[ k′]

13

for i = 1 to n

14

exchange aki with// exchange rows k and k

ak′i

15

for i = k + 1 to n

16

aik = aik/ akk

17

for j = k + 1 to n

18

aij = aijaikakj // compute L and U in place in A Like LU-DECOMPOSITION, the LUP-DECOMPOSITION

procedure replaces the recursion with an iteration loop. As an

improvement over a direct implementation of the recursion, the

procedure dynamically maintains the permutation matrix P as an array

π, where π[ i] = j means that the i th row of P contains a 1 in column j.

The LUP-DECOMPOSITION procedure also implements the

improvement mentioned earlier, computing L and U in place in the matrix A. Thus, when the procedure terminates,

Figure 28.2 illustrates how LUP-DECOMPOSITION factors a

matrix. Lines 2–3 initialize the array π to represent the identity

permutation. The outer for loop of lines 4–18 implements the recursion,

finding an LUP decomposition of the ( nk + 1) × ( nk + 1) submatrix whose upper left is in row k and column k. Each time through the outer loop, lines 5–9 determine the element ak′k with the

largest absolute value of those in the current first column (column k) of

the ( nk + 1) × ( nk + 1) submatrix that the procedure is currently working on. If all elements in the current first column are 0, lines 10–11

Image 984

report that the matrix is singular. To pivot, line 12 exchanges π[ k′] with π[ k], and lines 13–14 exchange the k th and k′th rows of A, thereby making the pivot element akk. (The entire rows are swapped because in

the derivation of the method above, not only is A′ − vw T/ ak 1 multiplied by P′, but so is v/ ak 1.) Finally, the Schur complement is computed by lines 15–18 in much the same way as it is computed by lines 6–11 of LU-DECOMPOSITION, except that here the operation is written to work

in place.

Figure 28.2 The operation of LUP-DECOMPOSITION. (a) The input matrix A with the identity permutation of the rows in yellow on the left. The first step of the algorithm determines that the element 5 highlighted in blue in the third row is the pivot for the first column. (b) Rows 1 and 3 are swapped and the permutation is updated. The tan column and row represent v and w T. (c) The vector v is replaced by v/5, and the lower right of the matrix is updated with the Schur complement. Lines divide the matrix into three regions: elements of U (above), elements of L (left), and elements of the Schur complement (lower right). (d)–(f) The second step. (g)–(i) The third step. No further changes occur on the fourth (final) step. (j) The LUP decomposition PA = LU.

Image 985

Image 986

Image 987

Because of its triply nested loop structure, LUP-

DECOMPOSITION has a running time of Θ( n 3), which is the same as

that of LU-DECOMPOSITION. Thus, pivoting costs at most a

constant factor in time.

Exercises

28.1-1

Solve the equation

by using forward substitution.

28.1-2

Find an LU decomposition of the matrix

28.1-3

Solve the equation

by using an LUP decomposition.

28.1-4

Describe the LUP decomposition of a diagonal matrix.

28.1-5

Describe the LUP decomposition of a permutation matrix, and prove

that it is unique.

28.1-6

Show that for all n ≥ 1, there exists a singular n × n matrix that has an LU decomposition.

Image 988

28.1-7

In LU-DECOMPOSITION, is it necessary to perform the outermost

for loop iteration when k = n? How about in LUP-

DECOMPOSITION?

28.2 Inverting matrices

Although you can use equation (28.3) to solve a system of linear

equations by computing a matrix inverse, in practice you are better off

using more numerically stable techniques, such as LUP decomposition.

Sometimes, however, you really do need to compute a matrix inverse.

This section shows how to use LUP decomposition to compute a matrix

inverse. It also proves that matrix multiplication and computing the

inverse of a matrix are equivalently hard problems, in that (subject to

technical conditions) an algorithm for one can solve the other in the

same asymptotic running time. Thus, you can use Strassen’s algorithm

(see Section 4.2) for matrix multiplication to invert a matrix. Indeed, Strassen’s original paper was motivated by the idea that a set of a linear

equations could be solved more quickly than by the usual method.

Computing a matrix inverse from an LUP decomposition

Suppose that you have an LUP decomposition of a matrix A in the form

of three matrices L, U, and P such that PA = LU. Using LUP-SOLVE, you can solve an equation of the form Ax = b in Θ( n 2) time. Since the LUP decomposition depends on A but not b, you can run LUP-SOLVE

on a second set of equations of the form Ax = b′ in Θ( n 2) additional time. In general, once you have the LUP decomposition of A, you can

solve, in Θ( kn 2) time, k versions of the equation Ax = b that differ only in the vector b.

Let’s think of the equation

which defines the matrix X, the inverse of A, as a set of n distinct equations of the form Ax = b. To be precise, let Xi denote the i th

Image 989

column of X, and recall that the unit vector ei is the i th column of In.

You can then solve equation (28.11) for X by using the LUP

decomposition for A to solve each equation

AXi = ei

separately for Xi. Once you have the LUP decomposition, you can

compute each of the n columns Xi in Θ( n 2) time, and so you can compute X from the LUP decomposition of A in Θ( n 3) time. Since you find the LUP decomposition of A in Θ( n 3) time, you can compute the

inverse A−1 of a matrix A in Θ( n 3) time.

Matrix multiplication and matrix inversion

Now let’s see how the theoretical speedups obtained for matrix

multiplication translate to speedups for matrix inversion. In fact, we’ll

prove something stronger: matrix inversion is equivalent to matrix

multiplication, in the following sense. If M( n) denotes the time to multiply two n × n matrices, then a nonsingular n × n matrix can be inverted in O( M( n)) time. Moreover, if I( n) denotes the time to invert a nonsingular n × n matrix, then two n × n matrices can be multiplied in O( I( n)) time. We prove these results as two separate theorems.

Theorem 28.1 (Multiplication is no harder than inversion)

If an n × n matrix can be inverted in I( n) time, where I( n) = Ω( n 2) and I( n) satisfies the regularity condition I(3 n) = O( I( n)), then two n × n matrices can be multiplied in O( I( n)) time.

Proof Let A and B be n × n matrices. To compute their product C =

AB, define the 3 n × 3 n matrix D by

The inverse of D is

Image 990

and thus to compute the product AB, just take the upper right n × n submatrix of D−1.

Constructing matrix D takes Θ( n 2) time, which is O( I( n)) from the assumption that I( n) = Ω( n 2), and inverting D takes O( I(3 n)) = O( I( n)) time, by the regularity condition on I( n). We thus have M( n) = O( I( n)).

Note that I( n) satisfies the regularity condition whenever I( n) = Θ( nc lg dn) for any constants c > 0 and d ≥ 0.

The proof that matrix inversion is no harder than matrix

multiplication relies on some properties of symmetric positive-definite

matrices proved in Section 28.3.

Theorem 28.2 (Inversion is no harder than multiplication)

Suppose that two n × n real matrices can be multiplied in M( n) time, where M( n) = Ω( n 2) and M( n) satisfies the following two regularity conditions:

1. M( n + k) = O( M( n)) for any k in the range 0 ≤ k < n, and 2. M( n/2) ≤ cM( n) for some constant c < 1/2.

Then the inverse of any real nonsingular n× n matrix can be computed in

O( M( n)) time.

Proof Let A be an n × n matrix with real-valued entries that is nonsingular. Assume that n is an exact power of 2 (i.e., n = 2 l for some integer l); we’ll see at the end of the proof what to do if n is not an exact power of 2.

For the moment, assume that the n × n matrix A is symmetric and positive-definite. Partition each of A and its inverse A−1 into four n/2 ×

n/2 submatrices:

Image 991

Image 992

Image 993

Then, if we let

be the Schur complement of A with respect to B (we’ll see more about

this form of Schur complement in Section 28.3), we have

since AA−1 = In, as you can verify by performing the matrix

multiplication. Because A is symmetric and positive-definite, Lemmas

28.4 and 28.5 in Section 28.3 imply that B and S are both symmetric and positive-definite. By Lemma 28.3 in Section 28.3, therefore, the inverses B−1 and S−1 exist, and by Exercise D.2-6 on page 1223, B−1

and S−1 are symmetric, so that ( B−1)T = B−1 and ( S−1)T = S−1.

Therefore, to compute the submatrices

R = B−1 + B−1 C T S−1 CB−1,

T = − B−1 C T S−1,

U = − S−1 CB−1, and

V = S−1

of A−1, do the following, where all matrices mentioned are n/2 × n/2: 1. Form the submatrices B, C, C T, and D of A.

2. Recursively compute the inverse B−1 of B.

3. Compute the matrix product W = CB−1, and then compute its

transpose W T, which equals B−1 C T (by Exercise D.1-2 on page

1219 and ( B−1)T = B−1).

Image 994

4. Compute the matrix product X = WC T, which equals CB−1 C T, and then compute the matrix S = DX = DCB−1 C T.

5. Recursively compute the inverse S−1 of S.

6. Compute the matrix product Y = S−1 W, which equals

S−1 CB−1, and then compute its transpose Y T, which equals B−1 C T S−1 (by Exercise D.1-2, ( B−1)T = B−1, and ( S−1)T =

S−1).

7. Compute the matrix product Z = W T Y, which equals

B−1 C T S−1 CB−1.

8. Set R = B−1 + Z.

9. Set T = − Y T.

10. Set U = − Y.

11. Set V = S−1.

Thus, to invert an n× n symmetric positive-definite matrix, invert two

n/2× n/2 matrices in steps 2 and 5; perform four multiplications of n/2 ×

n/2 matrices in steps 3, 4, 6, and 7; plus incur an additional cost of O( n 2) for extracting submatrices from A, inserting submatrices into A−1, and performing a constant number of additions, subtractions, and

transposes on n/2 × n/2 matrices. The running time is given by the recurrence

The second line follows from the assumption that M( n) = Ω( n 2) and from the second regularity condition in the statement of the theorem,

which implies that 4 M( n/2) < 2 M( n). Because M( n) = Ω( n 2), case 3 of

Image 995

Image 996

the master theorem (Theorem 4.1) applies to the recurrence (28.15),

giving the O( M( n)) result.

It remains to prove how to obtain the same asymptotic running time

for matrix multiplication as for matrix inversion when A is invertible but

not symmetric and positive-definite. The basic idea is that for any

nonsingular matrix A, the matrix A T A is symmetric (by Exercise D.1-2) and positive-definite (by Theorem D.6 on page 1222). The trick, then, is

to reduce the problem of inverting A to the problem of inverting A T A.

The reduction is based on the observation that when A is an n × n

nonsingular matrix, we have

A−1 = ( A T A)−1 A T,

since (( A T A)−1 A T) A = ( A T A)−1( A T A) = In and a matrix inverse is unique. Therefore, to compute A−1, first multiply A T by A to obtain A T A, then invert the symmetric positive-definite matrix A T A using the above divide-and-conquer algorithm, and finally multiply the result by

A T. Each of these three steps takes O( M( n)) time, and thus any nonsingular matrix with real entries can be inverted in O( M( n)) time.

The above proof assumed that A is an n × n matrix, where n is an exact power of 2. If n is not an exact power of 2, then let k < n be such that n + k is an exact power of 2, and define the ( n + k) × ( n + k) matrix A′ as

Then the inverse of A′ is

Apply the method of the proof to A′ to compute the inverse of A′, and

take the first n rows and n columns of the result as the desired answer A−1. The first regularity condition on M( n) ensures that enlarging the matrix in this way increases the running time by at most a constant

factor.

The proof of Theorem 28.2 suggests how to solve the equation Ax =

b by using LU decomposition without pivoting, so long as A is nonsingular. Let y = A T b. Multiply both sides of the equation Ax = b by A T, yielding ( A T A) x = A T b = y. This transformation doesn’t affect the solution x, since A T is invertible. Because A T A is symmetric positive-definite, it can be factored by computing an LU decomposition.

Then, use forward and back substitution to solve for x in the equation

( A T A) x = y. Although this method is theoretically correct, in practice the procedure LUP-DECOMPOSITION works much better. LUP

decomposition requires fewer arithmetic operations by a constant

factor, and it has somewhat better numerical properties.

Exercises

28.2-1

Let M( n) be the time to multiply two n × n matrices, and let S( n) denote the time required to square an n × n matrix. Show that multiplying and

squaring matrices have essentially the same difficulty: an M( n)-time matrix-multiplication algorithm implies an O( M( n))-time squaring algorithm, and an S( n)-time squaring algorithm implies an O( S( n))-time matrix-multiplication algorithm.

28.2-2

Let M( n) be the time to multiply two n × n matrices. Show that an M( n)-

time matrix-multiplication algorithm implies an O( M( n))-time LUP-decomposition algorithm. (The LUP decomposition your method

produces need not be the same as the result produced by the LUP-

DECOMPOSITION procedure.)

28.2-3

Let M( n) be the time to multiply two n × n boolean matrices, and let T( n) be the time to find the transitive closure of an n × n boolean matrix. (See Section 23.2. ) Show that an M( n)-time boolean matrix-multiplication algorithm implies an O( M( n) lg n)-time transitive-closure

algorithm, and a T( n)-time transitive-closure algorithm implies an O( T

( n))-time boolean matrix-multiplication algorithm.

28.2-4

Does the matrix-inversion algorithm based on Theorem 28.2 work when

matrix elements are drawn from the field of integers modulo 2? Explain.

28.2-5

Generalize the matrix-inversion algorithm of Theorem 28.2 to handle

matrices of complex numbers, and prove that your generalization works

correctly. ( Hint: Instead of the transpose of A, use the conjugate transpose A*, which you obtain from the transpose of A by replacing every entry with its complex conjugate. Instead of symmetric matrices,

consider Hermitian matrices, which are matrices A such that A = A*.)

28.3 Symmetric positive-definite matrices and least-squares

approximation

Symmetric positive-definite matrices have many interesting and

desirable properties. An n × n matrix A is symmetric positive-definite if A

= A T( A is symmetric) and x T Ax > 0 for all n-vectors x ≠ 0 ( A is positive-definite). Symmetric positive-definite matrices are nonsingular,

and an LU decomposition on them will not divide by 0. This section

proves these and several other important properties of symmetric

positive-definite matrices. We’ll also see an interesting application to

curve fitting by a least-squares approximation.

The first property we prove is perhaps the most basic.

Lemma 28.3

Any positive-definite matrix is nonsingular.

Proof Suppose that a matrix A is singular. Then by Corollary D.3 on

page 1221, there exists a nonzero vector x such that Ax = 0. Hence, x T Ax = 0, and A cannot be positive-definite.

Image 997

Image 998

Image 999

Image 1000

Image 1001

The proof that an LU decomposition on a symmetric positive-

definite matrix A won’t divide by 0 is more involved. We begin by

proving properties about certain submatrices of A. Define the k th leading submatrix of A to be the matrix Ak consisting of the intersection of the first k rows and first k columns of A.

Lemma 28.4

If A is a symmetric positive-definite matrix, then every leading

submatrix of A is symmetric and positive-definite.

Proof Since A is symmetric, each leading submatrix Ak is also symmetric. We’ll prove that Ak is positive-definite by contradiction. If

Ak is not positive-definite, then there exists a k-vector xk ≠ 0 such that

. Let A be n × n, and

for submatrices B (which is ( nkk) and C (which is ( nk)×( nk)).

Define the n-vector

, where nk 0s follow xk. Then we have

which contradicts A being positive-definite.

We now turn to some essential properties of the Schur complement.

Let A be a symmetric positive-definite matrix, and let Ak be a leading k

× k submatrix of A. Partition A once again according to equation (28.16). Equation (28.10) generalizes to define the Schur complement S

of A with respect to Ak as

Image 1002

Image 1003

Image 1004

Image 1005

Image 1006

Image 1007

(By Lemma 28.4, Ak is symmetric and positive-definite, and therefore,

exists by Lemma 28.3, and S is well defined.) The earlier definition

(28.10) of the Schur complement is consistent with equation (28.17) by

letting k = 1.

The next lemma shows that the Schur-complement matrices of

symmetric positive-definite matrices are themselves symmetric and

positive-definite. We used this result in Theorem 28.2, and its corollary

will help prove that LU decomposition works for symmetric positive-

definite matrices.

Lemma 28.5 (Schur complement lemma)

If A is a symmetric positive-definite matrix and Ak is a leading k × k submatrix of A, then the Schur complement S of A with respect to Ak is symmetric and positive-definite.

Proof Because A is symmetric, so is the submatrix C. By Exercise D.2-6 on page 1223, the product

is symmetric. Since C and

are

symmetric, then by Exercise D.1-1 on page 1219, so is S.

It remains to show that S is positive-definite. Consider the partition

of A given in equation (28.16). For any nonzero vector x, we have x T Ax

> 0 by the assumption that A is positive-definite. Let the subvectors y

and z consist of the first k and last nk elements in x, respectively, and thus they are compatible with Ak and C, respectively. Because exists, we have

This last equation, which you can verify by multiplying through,

amounts to “completing the square” of the quadratic form. (See

Exercise 28.3-2.)

Since x T Ax > 0 holds for any nonzero x, pick any nonzero z and then choose

, which causes the first term in equation (28.18)

Image 1008

Image 1009

Image 1010

to vanish, leaving

as the value of the expression. For any z ≠ 0, we therefore have z T Sz =

x T Ax > 0, and thus S is positive-definite.

Corollary 28.6

LU decomposition of a symmetric positive-definite matrix never causes

a division by 0.

Proof Let A be an n × n symmetric positive-definite matrix. In fact, we’ll prove a stronger result than the statement of the corollary: every

pivot is strictly positive. The first pivot is a 11. Let e 1 be the length- n unit vector ( 1 0 0 ⋯ 0 )T, so that

, which is positive because e 1 is

nonzero and A is positive definite. Since the first step of LU

decomposition produces the Schur complement of A with respect to A 1

= ( a 11), Lemma 28.5 implies by induction that all pivots are positive.

Least-squares approximation

One important application of symmetric positive-definite matrices arises

in fitting curves to given sets of data points. You are given a set of m

data points

( x 1, y 1), ( x 2, y 2), … , ( xm, ym),

where you know that the yi are subject to measurement errors. You wish

to determine a function F( x) such that the approximation errors

are small for i = 1, 2, … , m. The form of the function F depends on the problem at hand. Let’s assume that it has the form of a linearly

weighted sum

Image 1011

Image 1012

where the number n of summands and the specific basis functions fj are chosen based on knowledge of the problem at hand. A common choice

is fj( x) = xj−1, which means that

F( x) = c 1 + c 2 x + c 3 x 2 + ⋯ + cnxn−1

is a polynomial of degree n − 1 in x. Thus, if you are given m data points ( x 1, y 1), ( x 2, y 2), … , ( xm, ym), you need to calculate n coefficients c 1, c 2, … , cn that minimize the approximation errors η 1, η 2, … , ηm.

By choosing n = m, you can calculate each yi exactly in equation (28.19). Such a high-degree polynomial F “fits the noise” as well as the

data, however, and generally gives poor results when used to predict y

for previously unseen values of x. It is usually better to choose n significantly smaller than m and hope that by choosing the coefficients

cj well, you can obtain a function F that finds the significant patterns in the data points without paying undue attention to the noise. Some

theoretical principles exist for choosing n, but they are beyond the scope

of this text. In any case, once you choose a value of n that is less than m, you end up with an overdetermined set of equations whose solution you

wish to approximate. Let’s see how to do so.

Let

denote the matrix of values of the basis functions at the given points,

that is, aij = fj( xi). Let c = ( ck) denote the desired n-vector of coefficients. Then,

Image 1013

Image 1014

Image 1015

Image 1016

Image 1017

is the m-vector of “predicted values” for y. Thus,

η = Acy

is the m-vector of approximation errors.

To minimize approximation errors, let’s minimize the norm of the

error vector η, which gives a least-squares solution, since

Because

to minimize ∥ η∥, differentiate ∥ η∥2 with respect to each ck and then set the result to 0:

The n equations (28.20) for k = 1, 2, … , n are equivalent to the single matrix equation

( Acy)T A = 0

or, equivalently (using Exercise D.1-2 on page 1219), to

A T( Acy) = 0,

which implies

Image 1018

Image 1019

In statistics, equation (28.21) is called the normal equation. The matrix

A T A is symmetric by Exercise D.1-2, and if A has full column rank, then by Theorem D.6 on page 1222, A T A is positive-definite as well.

Hence, ( A T A)−1 exists, and the solution to equation (28.21) is

where the matrix A+ = (( A T A)−1 A T) is the pseudoinverse of the matrix A. The pseudoinverse naturally generalizes the notion of a matrix

inverse to the case in which A is not square. (Compare equation (28.22)

as the approximate solution to Ac = y with the solution A−1 b as the exact solution to Ax = b.)

As an example of producing a least-squares fit, suppose that you

have five data points

( x 1, y 1) = (−1, 2),

( x 2, y 2) = (1, 1),

( x 3, y 3) = (2, 1),

( x 4, y 4) = (3, 0),

( x 5, y 5) = (5, 3),

shown as orange dots in Figure 28.3, and you want to fit these points with a quadratic polynomial

F( x) = c 1 + c 2 x + c 3 x 2.

Start with the matrix of basis-function values

whose pseudoinverse is

Image 1020

Image 1021

Image 1022

Figure 28.3 The least-squares fit of a quadratic polynomial to the set of five data points {(−1, 2), (1, 1), (2, 1), (3, 0), (5, 3)}. The orange dots are the data points, and the blue dots are their estimated values predicted by the polynomial F( x) = 1.2 − 0.757 x + 0.214 x 2, the quadratic polynomial that minimizes the sum of the squared errors, plotted in blue. Each orange line shows the error for one data point.

Multiplying y by A+ gives the coefficient vector

which corresponds to the quadratic polynomial

F( x) = 1.200 − 0.757 x + 0.214 x 2

as the closest-fitting quadratic to the given data, in a least-squares sense.

As a practical matter, you would typically solve the normal equation

(28.21) by multiplying y by A T and then finding an LU decomposition

of A T A. If A has full rank, the matrix A T A is guaranteed to be nonsingular, because it is symmetric and positive-definite. (See Exercise

D.1-2 and Theorem D.6.)

Image 1023

Figure 28.4 A least-squares fit of a curve of the form

c 1 + c 2 x + c 3 x 2 + c 4 sin(2 πx) + c 5 cos(2 πx) for the carbon-dioxide concentrations measured in Mauna Loa, Hawaii from 19901 to 2019, where x is the number of years elapsed since 1990. This curve is the famous “Keeling curve,”

illustrating curve-fitting to nonpolynomial formulas. The sine and cosine terms allow modeling of seasonal variations in CO2 concentrations. The red curve shows the measured CO2

concentrations. The best fit, shown in black, has the form

352.83 + 1.39 x + 0.02 x 2 + 2.83 sin(2 πx) − 0.94 cos(2 πx).

We close this section with an example in Figure 28.4, illustrating that a curve can also fit a nonpolynomial function. The curve confirms one

aspect of climate change: that carbon dioxide (CO2) concentrations

have steadily increased over a period of 29 years. Linear and quadratic

terms model the annual increase, and sine and cosine terms model

seasonal variations.

Exercises

28.3-1

Image 1024

Prove that every diagonal element of a symmetric positive-definite

matrix is positive.

28.3-2

Let

be a 2 × 2 symmetric positive-definite matrix. Prove that

its determinant acb 2 is positive by “completing the square” in a manner similar to that used in the proof of Lemma 28.5.

28.3-3

Prove that the maximum element in a symmetric positive-definite matrix

lies on the diagonal.

28.3-4

Prove that the determinant of each leading submatrix of a symmetric

positive-definite matrix is positive.

28.3-5

Let Ak denote the k th leading submatrix of a symmetric positive-definite matrix A. Prove that det( Ak)/det( Ak−1) is the k th pivot during LU decomposition, where, by convention, det( A 0) = 1.

28.3-6

Find the function of the form

F( x) = c 1 + c 2 x lg x + c 3 ex that is the best least-squares fit to the data points

(1, 1), (2, 1), (3, 3), (4, 8).

28.3-7

Show that the pseudoinverse A+ satisfies the following four equations:

AA+ A = A,

A+ AA+ = A+,

( AA+)T = AA+,

Image 1025

( A+ A)T = A+ A.

Problems

28-1 Tridiagonal systems of linear equations

Consider the tridiagonal matrix

a. Find an LU decomposition of A.

b. Solve the equation Ax = ( 1 1 1 1 1 )T by using forward and back

substitution.

c. Find the inverse of A.

d. Show how to solve the equation Ax = b for any n × n symmetric positive-definite, tridiagonal matrix A and any n-vector b in O( n) time by performing an LU decomposition. Argue that any method based

on forming A−1 is asymptotically more expensive in the worst case.

e. Show how to solve the equation Ax = b for any n × n nonsingular, tridiagonal matrix A and any n-vector b in O( n) time by performing an LUP decomposition.

28-2 Splines

A practical method for interpolating a set of points with a curve is to

use cubic splines. You are given a set {( xi, yi) : i = 0, 1, … , n} of n + 1

point-value pairs, where x 0 < x 1 < ⋯ < xn. Your goal is to fit a piecewise-cubic curve (spline) f( x) to the points. That is, the curve f( x) is made up of n cubic polynomials fi( x) = ai + bix + cix 2 + dix 3 for i = 0, 1, … , n − 1, where if x falls in the range xixxi+1, then the value of the curve is given by f( x) = fi( xxi). The points xi at which the cubic

Image 1026

Image 1027

Image 1028

Image 1029

Image 1030

Image 1031

polynomials are “pasted” together are called knots. For simplicity,

assume that xi = i for i = 0, 1, … , n.

To ensure continuity of f( x), require that

f( xi) = fi(0) = yi,

f( xi+1) = fi(1) = yi+1

for i = 0, 1, … , n − 1. To ensure that f( x) is sufficiently smooth, also require the first derivative to be continuous at each knot:

for i = 0, 1, … , n − 2.

a. Suppose that for i = 0, 1, … , n, in addition to the point-value pairs

{( xi, yi)}, you are also given the first derivative Di = f′( xi) at each knot. Express each coefficient ai, bi, ci, and di in terms of the values yi, yi+1, Di, and Di+1. (Remember that xi = i.) How quickly can you compute the 4 n coefficients from the point-value pairs and first

derivatives?