Last-in, first-out (LIFO): evict the block that has been in the
cache the shortest time.
Least Recently Used (LRU): evict the block whose last use is
furthest in the past.
Least Frequently Used (LFU): evict the block that has been
accessed the fewest times, breaking ties by choosing the block that
has been in the cache the longest.
To analyze these algorithms, we assume that the cache starts out
empty, so that no evictions occur during the first k requests. We wish to
compare the performance of an online algorithm to an optimal offline
algorithm that knows the future requests. As we will soon see, all these
deterministic online algorithms have a lower bound of Ω( k) for their
competitive ratio. Some deterministic algorithms also have a
competitive ratio with an O( k) upper bound, but some other
deterministic algorithms are considerably worse, having a competitive
ratio of Θ( n/ k).
We now proceed to analyze the LIFO and LRU policies. In addition
to assuming that n > k, we will assume that at least k distinct blocks are requested. Otherwise, the cache never fills up and no blocks are evicted,
so that all algorithms exhibit the same behavior. We begin by showing that LIFO has a large competitive ratio.
Theorem 27.2
LIFO has a competitive ratio of Θ( n/ k) for the online caching problem
with n requests and a cache of size k.
Proof We first show a lower bound of Ω( n/ k). Suppose that the input consists of k + 1 blocks, numbered 1, 2, … , k + 1, and the request sequence is
1, 2, 3, 4, … , k, k + 1, k, k + 1, k, k + 1, … , where after the initial 1, 2, … , k, k + 1, the remainder of the sequence alternates between k and k + 1, with a total of n requests. The sequence ends on block k if n and k are either both even or both odd, and otherwise, the sequence ends on block k+1. That is, bi = i for i = 1, 2, …
k−1, bi = k+1 for i = k+1, k+3, … and bi = k for i = k, k + 2, …. How many blocks does LIFO evict? After the first k requests (which are
considered to be cache misses), the cache is filled with blocks 1, 2, … , k.
The ( k + 1)st request, which is for block k + 1, causes block k to be evicted. The ( k + 2)nd request, which is for block k, forces block k + 1
to be evicted, since that block was just placed into the cache. This
behavior continues, alternately evicting blocks k and k+1 for the remaining requests. LIFO, therefore, suffers a cache miss on every one
of the n requests.
The optimal offline algorithm knows the entire sequence of requests
in advance. Upon the first request of block k + 1, it just evicts any block
except block k, and then it never evicts another block. Thus, the optimal
offline algorithm evicts only once. Since the first k requests are
considered cache misses, the total number of cache misses is k + 1. The
competitive ratio, therefore, is n/( k + 1), or Ω( n/ k).
For the upper bound, observe that on any input of size n, any
caching algorithm incurs at most n cache misses. Because the input
contains at least k distinct blocks, any caching algorithm, including the
optimal offline algorithm, must incur at least k cache misses. Therefore,
LIFO has a competitive ratio of O( n/ k).

▪
We call such a competitive ratio unbounded, because it grows with
the input size. Exercise 27.3-2 asks you to show that LFU also has an
unbounded competitive ratio.
FIFO and LRU have a much better competitive ratio of Θ( k). There
is a big difference between competitive ratios of Θ( n/ k) and Θ( k). The cache size k is independent of the input sequence and does not grow as
more requests arrive over time. A competitive ratio that depends on n,
on the other hand, does grow with the size of the input sequence and
thus can get quite large. It is preferable to use an algorithm with a
competitive ratio that does not grow with the input sequence’s size,
when possible.
We now show that LRU has a competitive ratio of Θ( k), first
showing the upper bound.
Theorem 27.3
LRU has a competitive ratio of O( k) for the online caching problem with n requests and a cache of size k.
Proof To analyze LRU, we will divide the sequence of requests into epochs. Epoch 1 begins with the first request. Epoch i, for i > 1, begins upon encountering the ( k + 1)st distinct request since the beginning of
epoch i − 1. Consider the following example of requests with k = 3:
The first k = 3 distinct requests are for blocks 1, 2 and 5, so epoch 2
begins with the first request for block 4. In epoch 2, the first 3 distinct
requests are for blocks 4, 1, and 2. Requests for these blocks recur until
the request for block 3, and with this request epoch 3 begins. Thus, this
example has four epochs:
Now we consider the behavior of LRU. In each epoch, the first time
a request for a particular block appears, it may cause a cache miss, but
subsequent requests for that block within the epoch cannot cause a
cache miss, since the block is now one of the k most recently used. For
example, in epoch 2, the first request for block 4 causes a cache miss, but the subsequent requests for block 4 do not. (Exercise 27.3-1 asks
you to show the contents of the cache after each request.) In epoch 3,
requests for blocks 3 and 5 cause cache misses, but the request for block
4 does not, because it was recently accessed in epoch 2. Since only the
first request for a block within an epoch can cause a cache miss and the
cache holds k blocks, each epoch incurs at most k cache misses.
Now consider the behavior of the optimal algorithm. The first
request in each epoch must cause a cache miss, even for an optimal
algorithm. The miss occurs because, by the definition of an epoch, there
must have been k other blocks accessed since the last access to this block.
Since, for each epoch, the optimal algorithm incurs at least one miss
and LRU incurs at most k, the competitive ratio is at most k/1 = O( k).
▪
Exercise 27.3-3 asks you to show that FIFO also has a competitive
ratio of O( k).
We could show lower bounds of Ω( k) on LRU and FIFO, but in fact,
we can make a much stronger statement: any deterministic online
caching algorithm must have a competitive ratio of Ω( k). The proof
relies on an adversary who knows the online algorithm being used and
can tailor the future requests to cause the online algorithm to incur
more cache misses than the optimal offline algorithm.
Consider a scenario in which the cache has size k and the set of
possible blocks to request is {1, 2, … , k + 1}. The first k requests are for blocks 1, 2, … , k, so that both the adversary and the deterministic
online algorithm place these blocks into the cache. The next request is
for block k + 1. In order to make room in the cache for block k + 1, the online algorithm evicts some block b 1 from the cache. The adversary,
knowing that the online algorithm has just evicted block b 1, makes the
next request be for b 1, so that the online algorithm must evict some other block b 2 to clear room in the cache for b 1. As you might have guessed, the adversary makes the next request be for block b 2, so that
the online algorithm evicts some other block b 3 to make room for b 2.

The online algorithm and the adversary continue in this manner. The
online algorithm incurs a cache miss on every request and therefore
incurs n cache misses over the n requests.
Now let’s consider an optimal offline algorithm, which knows the
future. As discussed in Section 15.4, this algorithm is known as furthest-in-future, and it always evicts the block whose next request is furthest in
the future. Since there are only k + 1 unique blocks, when furthest-in-
future evicts a block, we know that it will not be accessed during at least
the next k requests. Thus, after the first k cache misses, the optimal algorithm incurs a cache miss at most once every k requests. Therefore,
the number of cache misses over n requests is at most k + n/ k.
Since the deterministic online algorithm incurs n cache misses and
the optimal offline algorithm incurs at most k + n/ k cache misses, the competitive ratio is at least
For n ≥ k 2, the above expression is at least
Thus, for sufficiently long request sequences, we have shown the
following:
Theorem 27.4
Any deterministic online algorithm for caching with a cache size of k
has competitive ratio Ω( k).
▪
Although we can analyze the common caching strategies from the
point of view of competitive analysis, the results are somewhat
unsatisfying. Yes, we can distinguish between algorithms with a
competitive ratio of Θ( k) and those with unbounded competitive ratios.
In the end, however, all of these competitive ratios are rather high. The
online algorithms we have seen so far are deterministic, and it is this
property that the adversary is able to exploit.
27.3.2 Randomized caching algorithms
If we don’t limit ourselves to deterministic online algorithms, we can use
randomization to develop an online caching algorithm with a
significantly smaller competitive ratio. Before describing the algorithm,
let’s discuss randomization in online algorithms in general. Recall that
we analyze online algorithms with respect to an adversary who knows
the online algorithm and can design requests knowing the decisions
made by the online algorithm. With randomization, we must ask
whether the adversary also knows the random choices made by the
online algorithm. An adversary who does not know the random choices
is oblivious, and an adversary who knows the random choices is
nonoblivious. Ideally, we prefer to design algorithms against a
nonoblivious adversary, as this adversary is stronger than an oblivious
one. Unfortunately, a nonoblivious adversary mitigates much of the
power of randomness, as an adversary who knows the outcome of
random choices typically can act as if the online algorithm is
deterministic. The oblivious adversary, on the other hand, does not
know the random choices of the online algorithm, and that is the
adversary we typically use.
As a simple illustration of the difference between an oblivious and
nonoblivious adversary, imagine that you are flipping a fair coin n times,
and the adversary wants to know how many heads you flipped. A
nonoblivious adversary knows, after each flip, whether the coin came up
heads or tails, and hence knows how many heads you flipped. An
oblivious adversary, on the other hand, knows only that you are flipping
a fair coin n times. The oblivious adversary, therefore, can reason that
the number of heads follows a binomial distribution, so that the
expected number of heads is n/2 (by equation (C.41) on page 1199) and
the variance is n/4 (by equation (C.44) on page 1200). But the oblivious
adversary has no way of knowing exactly how many heads you actually
flipped.
Let’s return to caching. We’ll start with a deterministic algorithm
and then randomize it. The algorithm we’ll use is an approximation of
LRU called MARKING. Rather than “least recently used,” think of
MARKING as simply “recently used.” MARKING maintains a 1-bit
attribute mark for each block in the cache. Initially, all blocks in the cache are unmarked. When a block is requested, if it is already in the
cache, it is marked. If the request is a cache miss, MARKING checks to
see whether there are any unmarked blocks in the cache. If all blocks are
marked, then they are all changed to unmarked. Now, regardless of
whether all blocks in the cache were marked when the request occurred,
there is at least one unmarked block in the cache, and so an arbitrary
unmarked block is evicted, and the requested block is placed into the
cache and marked.
How should the block to evict from among the unmarked blocks in
the cache be chosen? The procedure RANDOMIZED-MARKING on
the next page shows the process when the block is chosen randomly. The
procedure takes as input a block b being requested.
RANDOMIZED-MARKING( b)
1if block b resides in the cache,
2
b.mark = 1
3else
4
if all blocks b′ in the cache have b′. mark = 1
5
unmark all blocks b′ in the cache, setting b′. mark = 0
6
select an unmarked block u with u. mark = 0 uniformly at random 7
evict block u
8
place block b into the cache
9
b. mark = 1
For the purpose of analysis, we say that a new epoch begins
immediately after each time line 5 executes. An epoch starts with no
marked blocks in the cache. The first time a block is requested during an
epoch, the number of marked blocks increases by 1, and any subsequent
requests to that block do not change the number of marked blocks.
Therefore, the number of marked blocks monotonically increases within
an epoch. Under this view, epochs are the same as in the proof of
Theorem 27.3: with a cache that holds k blocks, an epoch comprises
requests for k distinct blocks (possibly fewer for the final epoch), and the next epoch begins upon a request for a block not in those k.
Because we are going to analyze a randomized algorithm, we will
compute the expected competitive ratio. Recall that for an input I, we
denote the solution value of an online algorithm A by A( I) and the solution value of an optimal algorithm F by F( I). Online algorithm A has an expected competitive ratio c if for all inputs I, we have where the expectation is taken over the random choices made by A.
Although the deterministic MARKING algorithm has a competitive
ratio of Θ( k) (Theorem 27.4 provides the lower bound and see Exercise
27.3-4 for the upper bound), RANDOMIZED-MARKING has a much
smaller expected competitive ratio, namely O(lg k). The key to the improved competitive ratio is that the adversary cannot always make a
request for a block that is not in the cache, since an oblivious adversary
does not know which blocks are in the cache.
Theorem 27.5
RANDOMIZED-MARKING has an expected competitive ratio of
O(lg k) for the online caching problem with n requests and a cache of size k, against an oblivious adversary.
Before proving Theorem 27.5, we prove a basic probabilistic fact.
Lemma 27.6
Suppose that a bag contains x + y balls: x − 1 blue balls, y white balls, and 1 red ball. You repeatedly choose a ball at random and remove it
from the bag until you have chosen a total of m balls that are either blue
or red, where m ≤ x. You set aside each white ball you choose. Then, one of the balls chosen is the red ball with probability m/ x.
Proof Choosing a white ball does not affect how many blue or red balls
are chosen in any way. Therefore, we can continue the analysis as if
there were no white balls and the bag contains just x − 1 blue balls and
1 red ball.
Let A be the event that the red ball is not chosen, and let Ai be the
event that the i th draw does not choose the red ball. By equation (C.22)
on page 1190, we have


The probability Pr{ A 1} that the first ball is blue equals ( x − 1)/ x, since initially there are x − 1 blue balls and 1 red ball. More generally, we have
since the i th draw is from x − i blue balls and 1 red ball. Equations (27.13) and (27.14) give
The right-hand side of equation (27.15) is a telescoping product, similar
to the telescoping series in equation (A.12) on page 1143. The
numerator of one term equals the denominator of the next, so that
everything except the first denominator and last numerator cancel, and
we obtain Pr{ A} = ( x − m)/ x. Since we actually want to compute Pr{ Ā}
= 1 − Pr{ A}, that is, the probability that the red ball is chosen, we get Pr{ Ā} = 1 − ( x − m)/ x = m/ x.
▪
Now we can prove Theorem 27.5.
Proof We’ll analyze RANDOMIZED-MARKING one epoch at a
time. Within epoch i, any request for a block b that is not the first request for block b in epoch i must result in a cache hit, since after the first request in epoch i, block b resides in the cache and is marked, so that it cannot be evicted during the epoch. Therefore, since we are
counting cache misses, we’ll consider only the first request for each
block within each epoch, disregarding all other requests.
We can classify the requests in an epoch as either old or new. If block
b resides in the cache at the start of epoch i, each request for block b during epoch i is an old request. Old requests in epoch i are for blocks requested in epoch i − 1. If a request in epoch i is not old, it is a new
request, and it is for a block not requested in epoch i − 1. All requests in epoch 1 are new. For example, let’s look again at the request sequence in
example (27.11):
1, 2, 1, 5
4, 4, 1, 2, 4, 2
3, 4, 5 2, 2, 1, 2, 2.
Since we can disregard all requests for a block within an epoch other
than the first request, to analyze the cache behavior, we can view this
request sequence as just
1, 2, 5
4, 1, 2
3, 4, 5
2, 1.
All three requests in epoch 1 are new. In epoch 2, the requests for blocks
1 and 2 are old, but the request for block 4 is new. In epoch 3, the
request for block 4 is old, and the requests for blocks 3 and 5 are new.
Both requests in epoch 4 are new.
Within an epoch, each new request must cause a cache miss since, by
definition, the block is not already in the cache. An old request, on the
other hand, may or may not cause a cache miss. The old block is in the
cache at the beginning of the epoch, but other requests might cause it to
be evicted. Returning to our example, in epoch 2, the request for block 4
must cause a cache miss, as this request is new. The request for block 1,
which is old, may or may not cause a cache miss. If block 1 was evicted
when block 4 was requested, then a cache miss occurs and block 1 must
be brought back into the cache. If instead block 1 was not evicted when
block 4 was requested, then the request for block 1 results in a cache hit.
The request for block 2 could incur a cache miss under two scenarios.
One is if block 2 was evicted when block 4 was requested. The other is if
block 1 was evicted when block 4 was requested, and then block 2 was
evicted when block 1 was requested. We see that, within an epoch, each
ensuing old request has an increasing chance of causing a cache miss.
Because we consider only the first request for each block within an
epoch, we assume that each epoch contains exactly k requests, and each
request within an epoch is for a unique block. (The last epoch might
contain fewer than k requests. If it does, just add dummy requests to fill
it out to k requests.) In epoch i, denote the number of new requests by ri
≥ 1 (an epoch must contain at least one new request), so that the
number of old requests is k − ri. As mentioned above, a new request always incurs a cache miss.
Let us now focus on an arbitrary epoch i to obtain a bound on the
expected number of cache misses within that epoch. In particular, let’s
think about the j th old request within the epoch, where 1 ≤ j < k.
Denote by bij the block requested in the j th old request of epoch i, and denote by nij and oij the number of new and old requests, respectively, that occur within epoch i but before the j th old request. Because j − 1
old requests occur before the j th old request, we have oij = j − 1. We will show that the probability of a cache miss upon the j th old request is
nij/( k − oij), or nij/( k − j + 1).
Start by considering the first old request, for block bi,1. What is the
probability that this request causes a cache miss? It causes a cache miss
precisely when one of the ni,1 previous requests resulted in bi,1 being evicted. We can determine the probability that bi,1 was chosen for
eviction by using Lemma 27.6: consider the k blocks in the cache to be k
balls, with block bi,1 as the red ball, the other k − 1 blocks as the k − 1
blue balls, and no white balls. Each of the ni,1 requests chooses a block
to evict with equal probability, corresponding to drawing balls ni,1
times. Thus, we can apply Lemma 27.6 with x = k, y = 0, and m = ni,1, deriving the probability of a cache miss upon the first old request as
ni,1/ k, which equals nij/( k − j + 1) since j = 1.
In order to determine the probability of a cache miss for subsequent
old requests, we’ll need an additional observation. Let’s consider the
second old request, which is for block bi,2. This request causes a cache
miss precisely when one of the previous requests evicts bi,2. Let’s
consider two cases, based on the request for bi,1. In the first case, suppose that the request for bi,1 did not cause an eviction, because bi,1
was already in the cache. Then, the only way that bi,2 could have been
evicted is by one of the ni,2 new requests that precedes it. What is the
probability that this eviction happens? There are ni,2 chances for bi,2 to
be evicted, but we also know that there is one block in the cache, namely bi,1, that is not evicted. Thus, we can again apply Lemma 27.6, but with
bi,1 as the white ball, bi,2 as the red ball, the remaining blocks as the blue balls, and drawing balls ni,2 times. Applying Lemma 27.6, with x =
k − 1, y = 1, and m = ni,2, we find that the probability of a cache miss is ni,2/( k − 1). In the second case, the request for bi,1 does cause an eviction, which can happen only if one of the new requests preceding the
request for bi,1 evicts bi,1. Then, the request for bi,1 brings bi,1 back into the cache and evicts some other block. In this case, we know that of
the new requests, one of them did not result in bi,2 being evicted, since
bi,1 was evicted. Therefore, ni,2 − 1 new requests could evict bi,2, as could the request for bi,1, so that the number of requests that could evict bi,2 is ni,2. Each such request evicts a block chosen from among k
− 1 blocks, since the request that resulted in evicting bi,1 did not also
cause bi,2 to be evicted. Therefore, we can apply Lemma 27.6, with x =
k − 1, y = 1, and m = ni,2, and get that the probability of a miss is ni,2/( k − 1). In both cases the probability is the same, and it equals nij/( k
− j + 1) since j = 2.
More generally, oij old requests occur before the j th old request.
Each of these prior old requests either caused an eviction or did not.
For those that caused an eviction, it is because they were evicted by a
previous request, and for those that did not cause an eviction, it is
because they were not evicted by any previous request. In either case, we
can decrease the number of blocks that the random process is choosing
from by 1 for each old request, and thus oij requests cannot cause bij to be evicted. Therefore, we can use Lemma 27.6 to determine the
probability that bij was evicted by a previous request, with x = k − oij, y
= oij and m = nij. Thus, we have proven our claim that the probability of a cache miss on the j th request for an old block is nij/( k − oij), or nij/( k
− j + 1). Since nij ≤ ri (recall that ri is the number of new requests during


epoch i), we have an upper bound of ri/( k − j + 1) on the probability that the j th old request incurs a cache miss.
We can now compute the expected number of misses during epoch i
using indicator random variables, as introduced in Section 5.2. We define indicator random variables
Yij = I{the j th old request in epoch i incurs a cache miss},
Zij = I{the j th new request in epoch i incurs a cache miss}.
We have Zij = 1 for j = 1, 2, … , ri, since every new request results in a cache miss. Let Xi be the random variable denoting the number of cache
misses during epoch i, so that
and so
where Hk is the k th harmonic number.
To compute the expected total number of cache misses, we sum over
all epochs. Let p denote the number of epochs and X be the random variable denoting the number of cache misses. Then, we have
,
so that


To complete the analysis, we need to understand the behavior of the
optimal offline algorithm. It could make a completely different set of
decisions from those made by RANDOMIZED-MARKING, and at
any point its cache may look nothing like the cache of the randomized
algorithm. Yet, we want to relate the number of cache misses of the
optimal offline algorithm to the value in inequality (27.17), in order to
have a competitive ratio that does not depend on
. Focusing on
individual epochs won’t suffice. At the beginning of any epoch, the
offline algorithm might have loaded the cache with exactly the blocks
that will be requested in that epoch. Therefore, we cannot take any one
epoch in isolation and claim that an offline algorithm must suffer any
cache misses during that epoch.
If we consider two consecutive epochs, however, we can better
analyze the optimal offline algorithm. Consider two consecutive epochs,
i −1 and i. Each contains k requests for k different blocks. (Recall our assumption that all requests are first requests in an epoch.) Epoch i
contains ri requests for new blocks, that is, blocks that were not
requested during epoch i − 1. Therefore, the number of distinct requests
during epochs i−1 and i is exactly k+ ri. No matter what the cache contents were at the beginning of epoch i − 1, after k + ri distinct requests, there must be at least ri cache misses. There could be more, but
there is no way to have fewer. Letting mi denote the number of cache
misses of the offline algorithm during epoch i, we have just argued that
The total number of cache misses of the offline algorithm is



The justification m 1 = r 1 for the last equality follows because, by our assumptions, the cache starts out empty and every request incurs a
cache miss in the first epoch, even for the optimal offline adversary.
To conclude the analysis, because we have an upper bound of
on the expected number of cache misses for RANDOMIZED-
MARKING and a lower bound of
on the number of cache
misses for the optimal offline algorithm, the expected competitive ratio
is at most
▪
Exercises
27.3-1
For the cache sequence (27.10), show the contents of the cache after
each request and count the number of cache misses. How many misses
does each epoch incur?
27.3-2
Show that LFU has a competitive ratio of Θ( n/ k) for the online caching
problem with n requests and a cache of size k.
27.3-3
Show that FIFO has a competitive ratio of O( k) for the online caching problem with n requests and a cache of size k.
27.3-4
Show that the deterministic MARKING algorithm has a competitive
ratio of O( k) for the online caching problem with n requests and a cache of size k.
27.3-5
Theorem 27.4 shows that any deterministic online algorithm for caching
has a competitive ratio of Ω( k), where k is the cache size. One way in
which an algorithm might be able to perform better is to have some
ability to know what the next few requests will be. We say that an
algorithm is l-lookahead if it has the ability to look ahead at the next l
requests. Prove that for every constant l ≥ 0 and every cache size k ≥1, every deterministic l-lookahead algorithm has competitive ratio Ω( k).
Problems
27-1 Cow-path problem
The Appalachian Trail (AT) is a marked hiking trail in the eastern
United States extending between Springer Mountain in Georgia and
Mount Katahdin in Maine. The trail is about 2,190 miles long. You
decide that you are going to hike the AT from Georgia to Maine and
back. You plan to learn more about algorithms while on the trail, and
so you bring along your copy of Introduction to Algorithms in your
backpack. 2 You have already read through this chapter before starting out. Because the beauty of the trail distracts you, you forget about
reading this book until you have reached Maine and hiked halfway back
to Georgia. At that point, you decide that you have already seen the
trail and want to continue reading the rest of the book, starting with
Chapter 28. Unfortunately, you find that the book is no longer in your pack. You must have left it somewhere along the trail, but you don’t
know where. It could be anywhere between Georgia and Maine. You
want to find the book, but now that you have learned something about
online algorithms, you want your algorithm for finding it to have a good
competitive ratio. That is, no matter where the book is, if its distance
from you is x miles away, you would like to be sure that you do not walk
more than cx miles to find it, for some constant c. You do not know x, though you may assume that x ≥ 1. 3
What algorithm should you use, and what constant c can you prove
bounds the total distance cx that you would have to walk? Your
algorithm should work for a trail of any length, not just the 2,190-mile-
long AT.
27-2 Online scheduling to minimize average completion time
Problem 15-2 discusses scheduling to minimize average completion time
on one machine, without release times and preemption and with release
times and preemption. Now you will develop an online algorithm for
nonpreemptively scheduling a set of tasks with release times. Suppose
you are given a set S = { a 1, a 2, … , an} of tasks, where task ai has release time ri, before which it cannot start, and requires pi units of processing time to complete once it has started. You have one computer
on which to run the tasks. Tasks cannot be preempted, which is to say
that once started, a task must run to completion without interruption.
(See Problem 15-2 on page 446 for a more detailed description of this
problem.) Given a schedule, let Ci be the completion time of task ai, that is, the time at which task ai completes processing. Your goal is to find a
schedule that minimizes the average completion time, that is, to
minimize
.
In the online version of this problem, you learn about task i only
when it arrives at its release time ri, and at that point, you know its processing time pi. The offline version of this problem is NP-hard (see
Chapter 34), but you will develop a 2-competitive online algorithm.
a. Show that, if there are release times, scheduling by shortest processing
time (when the machine becomes idle, start the already released task
with the smallest processing time that has not yet run) is not d-
competitive for any constant d.





In order to develop an online algorithm, consider the preemptive
version of this problem, which is discussed in Problem 15-2(b). One way
to schedule is to run the tasks according to the shortest remaining
processing time (SRPT) order. That is, at any point, the machine is
running the available task with the smallest amount of remaining
processing time.
b. Explain how to run SRPT as an online algorithm.
c. Suppose that you run SRPT and obtain completion times
.
Show that
where the are the completion times in an optimal nonpreemptive
schedule.
Consider the (offline) algorithm COMPLETION-TIME-SCHEDULE.
COMPLETION-TIME-SCHEDULE( S)
1 compute an optimal schedule for the preemptive version of the
problem
2 renumber the tasks so that the completion times in the optimal
preemptive schedule are ordered by their completion times
in SRPT order
3 greedily schedule the tasks nonpreemptively in the renumbered
order a 1, … , an
4 let C 1, … , Cn be the completion times of renumbered tasks a 1, … , an in this nonpreemptive schedule
5 return C 1, … , Cn
d. Prove that
for i = 1, … , n.
e. Prove that
for i = 1, … , n.
f. Algorithm COMPLETION-TIME-SCHEDULE is an offline
algorithm. Explain how to modify it to produce an online algorithm.
g. Combine parts (c)–(f) to show that the online version of
COMPLETION-TIME-SCHEDULE is 2-competitive.
Chapter notes
Online algorithms are widely used in many domains. Some good
overviews include the textbook by Borodin and El-Yaniv [68], the collection of surveys edited by Fiat and Woeginger [142], and the survey by Albers [14].
The move-to-front heuristic from Section 27.2 was analyzed by Sleator and Tarjan [416, 417] as part of their early work on amortized analysis. This rule works quite well in practice.
Competitive analysis of online caching also originated with Sleator
and Tarjan [417]. The randomized marking algorithm was proposed and analyzed by Fiat et al. [141]. Young [464] surveys online caching and paging algorithms, and Buchbinder and Naor [76] survey primal-dual online algorithms.
Specific types of online algorithms are described using other names.
Dynamic graph algorithms are online algorithms on graphs, where at
each step a vertex or edge undergoes modification. Typically a vertex or
edge is either inserted or deleted, or some associated property, such as
edge weight, changes. Some graph problems need to be solved again
after each change to the graph, and a good dynamic graph algorithm
will not need to solve from scratch. For example, edges are inserted and
deleted, and after each change to the graph, the minimum spanning tree
is recomputed. Exercise 21.2-8 asks such a question. Similar questions
can be asked for other graph algorithms, such as shortest paths,
connectivity, or matching. The first paper in this field is credited to Even
and Shiloach [138], who study how to maintain a shortest-path tree as edges are being deleted from a graph. Since then hundreds of papers
have been published. Demetrescu et al. [110] survey early developments in dynamic graph algorithms.
For massive data sets, the input data might be too large to store.
Streaming algorithms model this situation by requiring the memory
used by an algorithm to be significantly smaller than the input size. For
example, you may have a graph with n vertices and m edges with m ≫ n, but the memory allowed may be only O( n). Or you may have n numbers, but the memory allowed may only be O(lg n) or
. A streaming
algorithm is measured by the number of passes made over the data in
addition to the running time of the algorithm. McGregor [322] surveys streaming algorithms for graphs and Muthukrishnan [341] surveys general streaming algorithms.
1 The path-compression heuristic in Section 19.3 resembles MOVE-TO-FRONT, although it would be more accurately expressed as “move-to-next-to-front.” Unlike MOVE-TO-FRONT in
a doubly linked list, path compression can relocate multiple elements to become “next-to-front.”
2 This book is heavy. We do not recommend that you carry it on a long hike.
3 In case you’re wondering what this problem has to do with cows, some papers about it frame the problem as a cow looking for a field in which to graze.
Because operations on matrices lie at the heart of scientific computing,
efficient algorithms for working with matrices have many practical
applications. This chapter focuses on how to multiply matrices and solve
sets of simultaneous linear equations. Appendix D reviews the basics of matrices.
Section 28.1 shows how to solve a set of linear equations using LUP
decompositions. Then, Section 28.2 explores the close relationship between multiplying and inverting matrices. Finally, Section 28.3
discusses the important class of symmetric positive-definite matrices
and shows how to use them to find a least-squares solution to an
overdetermined set of linear equations.
One important issue that arises in practice is numerical stability.
Because actual computers have limits to how precisely they can
represent floating-point numbers, round-off errors in numerical
computations may become amplified over the course of a computation,
leading to incorrect results. Such computations are called numerically
unstable. Although we’ll briefly consider numerical stability on
occasion, we won’t focus on it in this chapter. We refer you to the
excellent book by Higham [216] for a thorough discussion of stability issues.
28.1 Solving systems of linear equations



Numerous applications need to solve sets of simultaneous linear
equations. A linear system can be cast as a matrix equation in which
each matrix or vector element belongs to a field, typically the real
numbers ℝ. This section discusses how to solve a system of linear
equations using a method called LUP decomposition.
The process starts with a set of linear equations in n unknowns x 1,
x 2, … , xn:
A solution to the equations (28.1) is a set of values for x 1, x 2, … , xn that satisfy all of the equations simultaneously. In this section, we treat
only the case in which there are exactly n equations in n unknowns.
Next, rewrite equations (28.1) as the matrix-vector equation
or, equivalently, letting A = ( aij), x = ( xi), and b = ( bi), as If A is nonsingular, it possesses an inverse A−1, and
is the solution vector. We can prove that x is the unique solution to equation (28.2) as follows. If there are two solutions, x and x′, then Ax
= Ax′ = b and, letting I denote an identity matrix,
x = Ix
= ( A−1 A) x
= A−1( Ax)
=
A−1( Ax′)
= ( A−1 A) x′
= Ix′
= x′.
This section focuses on the case in which A is nonsingular or,
equivalently (by Theorem D.1 on page 1220), the rank of A equals the
number n of unknowns. There are other possibilities, however, which
merit a brief discussion. If the number of equations is less than the
number n of unknowns—or, more generally, if the rank of A is less than
n—then the system is underdetermined. An underdetermined system
typically has infinitely many solutions, although it may have no
solutions at all if the equations are inconsistent. If the number of
equations exceeds the number n of unknowns, the system is
overdetermined, and there may not exist any solutions. Section 28.3
addresses the important problem of finding good approximate solutions
to overdetermined systems of linear equations.
Let’s return to the problem of solving the system Ax = b of n equations in n unknowns. One option is to compute A−1 and then, using equation (28.3), multiply b by A−1, yielding x = A−1 b. This approach suffers in practice from numerical instability. Fortunately,
another approach—LUP decomposition—is numerically stable and has
the further advantage of being faster in practice.
Overview of LUP decomposition
The idea behind LUP decomposition is to find three n × n matrices L, U, and P such that
where
L is a unit lower-triangular matrix,
U is an upper-triangular matrix, and
P is a permutation matrix.


We call matrices L, U, and P satisfying equation (28.4) an LUP
decomposition of the matrix A. We’ll show that every nonsingular matrix A possesses such a decomposition.
Computing an LUP decomposition for the matrix A has the
advantage that linear systems can be efficiently solved when they are
triangular, as is the case for both matrices L and U. If you have an LUP
decomposition for A, you can solve equation (28.2), Ax = b, by solving only triangular linear systems, as follows. Multiply both sides of Ax = b
by P, yielding the equivalent equation PAx = Pb. By Exercise D.1-4 on page 1219, multiplying both sides by a permutation matrix amounts to
permuting the equations (28.1). By the decomposition (28.4),
substituting LU for PA gives
LUx = Pb.
You can now solve this equation by solving two triangular linear
systems. Define y = Ux, where x is the desired solution vector. First, solve the lower-triangular system
for the unknown vector y by a method called “forward substitution.”
Having solved for y, solve the upper-triangular system
for the unknown x by a method called “back substitution.” Why does
this process solve Ax = b? Because the permutation matrix P is invertible (see Exercise D.2-3 on page 1223), multiplying both sides of
equation (28.4) by P −1 gives P−1 PA = P−1 LU, so that Hence, the vector x that satisfies Ux = y is the solution to Ax = b: Ax = P−1 LUx (by equation (28.7))
= P−1 Ly (by equation (28.6))
= P−1 Pb (by equation (28.5))
= b.
The next step is to show how forward and back substitution work
and then attack the problem of computing the LUP decomposition
itself.
Forward and back substitution
Forward substitution can solve the lower-triangular system (28.5) in
Θ( n 2) time, given L, P, and b. An array π[1 : n] provides a more compact format to represent the permutation P than an n × n matrix that is mostly 0s. For i = 1, 2, … , n, the entry π[ i] indicates that Pi,π[ i] = 1 and Pij = 0 for j ≠ π[ i]. Thus, PA has aπ[ i], j in row i and column j, and Pb has bπ[ i] as its i th element. Since L is unit lower-triangular, the matrix equation Ly = Pb is equivalent to the n equations
y 1
= bπ[1],
l 21 y 1 + y 2
= bπ[2],
l 31 y 1 + l 32 y 2 + y 3
= bπ[3],
⋮
ln 1 y 1 + ln 2 y 2 + ln 3 y 3 + ⋯ + yn = bπ[ n].
The first equation gives y 1 = bπ[1] directly. Knowing the value of y 1, you can substitute it into the second equation, yielding
y 2 = bπ[2] − l 21 y 1.
Next, you can substitute both y 1 and y 2 into the third equation, obtaining
y 3 = bπ[3] − ( l 31 y 1 + l 32 y 2).
In general, you substitute y 1, y 2, … , yi−1 “forward” into the i th equation to solve for yi:
Once you’ve solved for y, you can solve for x in equation (28.6) using
back substitution, which is similar to forward substitution. This time, you solve the n th equation first and work backward to the first
equation. Like forward substitution, this process runs in Θ( n 2) time.
Since U is upper-triangular, the matrix equation Ux = y is equivalent to the n equations
u 11 x 1 + u 12 x 2 + u 1, n−2 xn−2 + u 1, n−1 xn−1 +
u 1 nxn = y 1,
⋯ +
u 22 x 2 + ⋯ + u 2, n−2 xn−2 + u 2, n−1 xn−1 +
u 2 nxn = y 2,
⋮
un−2, n−2 xn−2 un−2, n−1 xn−1 un−2, nxn = yn−2,
+
+
un−1, n−1 xn−1 un−1, nxn = yn−1,
+
un,nxn = yn.
Thus, you can solve for xn, xn−1, … , x 1 successively as follows: xn
= yn/ un,n,
xn−1 = ( yn−1 − un−1, nxn)/ un−1, n−1,
xn−2 = ( yn−2 − ( un−2, n−1 xn−1 + un−2, nxn))/ un−2, n−2,
⋮
or, in general,
Given P, L, U, and b, the procedure LUP-SOLVE on the next page solves for x by combining forward and back substitution. The
permutation matrix P is represented by the array π. The procedure first solves for y using forward substitution in lines 2–3, and then it solves for
x using backward substitution in lines 4–5. Since the summation within






each of the for loops includes an implicit loop, the running time is
Θ( n 2).
As an example of these methods, consider the system of linear
equations defined by Ax = b, where
LUP-SOLVE( L, U, π, b, n)
1 let x and y be new vectors of length n
2 for i = 1 to n
3
4 for i = n downto 1
5
6 return x
and we want to solve for the unknown x. The LUP decomposition is
(You might want to verify that PA = LU.) Using forward substitution,
solve Ly = Pb for y:
obtaining
by computing first y 1, then y 2, and finally y 3. Then, using back substitution, solve Ux = y for x:

thereby obtaining the desired answer
by computing first x 3, then x 2, and finally x 1.
Computing an LU decomposition
Given an LUP decomposition for a nonsingular matrix A, you can use
forward and back substitution to solve the system Ax = b of linear equations. Now let’s see how to efficiently compute an LUP
decomposition for A. We start with the simpler case in which A is an n ×
n nonsingular matrix and P is absent (or, equivalently, P = In, the n × n identity matrix), so that A = LU. We call the two matrices L and U an LU decomposition of A.
To create an LU decomposition, we’ll use a process known as
Gaussian elimination. Start by subtracting multiples of the first equation
from the other equations in order to remove the first variable from those
equations. Then subtract multiples of the second equation from the
third and subsequent equations so that now the first and second
variables are removed from them. Continue this process until the system
that remains has an upper-triangular form—this is the matrix U. The
matrix L comprises the row multipliers that cause variables to be
eliminated.
To implement this strategy, let’s start with a recursive formulation.
The input is an n × n nonsingular matrix A. If n = 1, then nothing needs to be done: just choose L = I 1 and U = A. For n > 1, break A into four parts:


where v = ( a 21, a 31, … , an 1) is a column ( n−1)-vector, w T = ( a 12, a 13,
… , a 1 n)T is a row ( n − 1)-vector, and A′ is an ( n − 1) × ( n − 1) matrix.
Then, using matrix algebra (verify the equations by simply multiplying
through), factor A as
The 0s in the first and second matrices of equation (28.9) are row and
column ( n − 1)-vectors, respectively. The term vw T/ a 11 is an ( n − 1) × ( n
− 1) matrix formed by taking the outer product of v and w and dividing
each element of the result by a 11. Thus it conforms in size to the matrix
A′ from which it is subtracted. The resulting ( n − 1) × ( n − 1) matrix is called the Schur complement of A with respect to a 11.
We claim that if A is nonsingular, then the Schur complement is
nonsingular, too. Why? Suppose that the Schur complement, which is ( n
− 1) × ( n − 1), is singular. Then by Theorem D.1, it has row rank strictly
less than n − 1. Because the bottom n − 1 entries in the first column of the matrix
are all 0, the bottom n − 1 rows of this matrix must have row rank strictly less than n − 1. The row rank of the entire matrix, therefore, is
strictly less than n. Applying Exercise D.2-8 on page 1223 to equation
(28.9), A has rank strictly less than n, and from Theorem D.1, we derive the contradiction that A is singular.
Because the Schur complement is nonsingular, it, too, has an LU
decomposition, which we can find recursively. Let’s say that
A′ − vw T/ a 11 = L′ U′,

where L′ is unit lower-triangular and U′ is upper-triangular. The LU
decomposition of A is then A = LU, with
as shown by
Because L′ is unit lower-triangular, so is L, and because U′ is upper-triangular, so is U.
Of course, if a 11 = 0, this method doesn’t work, because it divides by
0. It also doesn’t work if the upper leftmost entry of the Schur
complement A′ − vw T/ a 11 is 0, since the next step of the recursion will divide by it. The denominators in each step of LU decomposition are
called pivots, and they occupy the diagonal elements of the matrix U.
The permutation matrix P included in LUP decomposition provides a
way to avoid dividing by 0, as we’ll see below. Using permutations to
avoid division by 0 (or by small numbers, which can contribute to
numerical instability), is called pivoting.
An important class of matrices for which LU decomposition always
works correctly is the class of symmetric positive-definite matrices. Such
matrices require no pivoting to avoid dividing by 0 in the recursive
strategy outlined above. We will prove this result, as well as several
others, in Section 28.3.
The pseudocode in the procedure LU-DECOMPOSITION follows
the recursive strategy, except that an iteration loop replaces the
recursion. (This transformation is a standard optimization for a “tail-
recursive” procedure—one whose last operation is a recursive call to
itself. See Problem 7-5 on page 202.) The procedure initializes the
matrix U with 0s below the diagonal and matrix L with 1s on its diagonal and 0s above the diagonal. Each iteration works on a square
submatrix, using its upper leftmost element as the pivot to compute the
v and w vectors and the Schur complement, which becomes the square
submatrix worked on by the next iteration.
LU-DECOMPOSITION( A, n)
1let L and U be new n × n matrices
2initialize U with 0s below the diagonal
3initialize L with 1s on the diagonal and 0s above the diagonal
4for k = 1 to n
5
ukk = akk
6
for i = k + 1 to n
7
lik = aik/ akk
// aik holds vi
8
uki = aki
// aki holds wi
9
for i = k + 1 to n
// compute the Schur complement …
10
for j = k + 1 to n
11
aij = aij − likukj
// … and store it back into A
12return L and U
Each recursive step in the description above takes place in one
iteration of the outer for loop of lines 4–11. Within this loop, line 5
determines the pivot to be ukk = akk. The for loop in lines 6–8 (which
does not execute when k = n) uses the v and w vectors to update L and U. Line 7 determines the below-diagonal elements of L, storing vi/ akk in lik, and line 8 computes the above-diagonal elements of U, storing wi in uki. Finally, lines 9–11 compute the elements of the Schur
complement and store them back into the matrix A. (There is no need
to divide by akk in line 11 because that already happened when line 7
computed lik.) Because line 11 is triply nested, LU-DECOMPOSITION
runs in Θ( n 3) time.
Figure 28.1 illustrates the operation of LU-DECOMPOSITION. It
shows a standard optimization of the procedure that stores the
significant elements of L and U in place in the matrix A. Each element aij corresponds to either lij (if i > j) or uij (if i ≤ j), so that the matrix A holds both L and U when the procedure terminates. To obtain the pseudocode for this optimization from the pseudocode for the LU-DECOMPOSITION procedure, just replace each reference to l or u by
a. You can verify that this transformation preserves correctness.
Figure 28.1 The operation of LU-DECOMPOSITION. (a) The matrix A. (b) The result of the first iteration of the outer for loop of lines 4–11. The element a 11 = 2 highlighted in blue is the pivot, the tan column is v/ a 11, and the tan row is w T. The elements of U computed thus far are above the horizontal line, and the elements of L are to the left of the vertical line. The Schur complement matrix A′ − vw T/ a 11 occupies the lower right. (c) The result of the next iteration of the outer for loop, on the Schur complement matrix from part (b). The element a 22 = 4
highlighted in blue is the pivot, and the tan column and row are v/ a 22 and w T (in the partitioning of the Schur complement), respectively. Lines divide the matrix into the elements of U computed so far (above), the elements of L computed so far (left), and the new Schur complement (lower right). (d) After the next iteration, the matrix A is factored. The element 3 in the new Schur complement becomes part of U when the recursion terminates.) (e) The factorization A = LU.
Computing an LUP decomposition
If the diagonal of the matrix given to LU-DECOMPOSITION contains
any 0s, then the procedure will attempt to divide by 0, which would
cause disaster. Even if the diagonal contains no 0s, but does have
numbers with small absolute values, dividing by such numbers can cause

numerical instabilities. Therefore, LUP decomposition pivots on entries
with the largest absolute values that it can find.
In LUP decomposition, the input is an n × n nonsingular matrix A, with a goal of finding a permutation matrix P, a unit lower-triangular
matrix L, and an upper-triangular matrix U such that PA = LU. Before partitioning the matrix A, as LU decomposition does, LUP
decomposition moves a nonzero element, say ak 1, from somewhere in
the first column to the (1, 1) position of the matrix. For the greatest
numerical stability, LUP decomposition chooses the element in the first
column with the greatest absolute value as ak 1. (The first column
cannot contain only 0s, for then A would be singular, because its
determinant would be 0, by Theorems D.4 and D.5 on page 1221.) In
order to preserve the set of equations, LUP decomposition exchanges
row 1 with row k, which is equivalent to multiplying A by a permutation matrix Q on the left (Exercise D.1-4 on page 1219). Thus, the analog to
equation (28.8) expresses QA as
where v = ( a 21, a 31, … , an 1), except that a 11 replaces ak 1; w T = ( ak 2, ak 3, … , akn)T; and A′ is an ( n − 1) × ( n − 1) matrix. Since ak 1 ≠ 0, the analog to equation (28.9) guarantees no division by 0:
Just as in LU decomposition, if A is nonsingular, then the Schur
complement A′ − vw T/ ak 1 is nonsingular, too. Therefore, you can recursively find an LUP decomposition for it, with unit lower-triangular
matrix L′, upper-triangular matrix U′, and permutation matrix P′, such that
P′( A′ − vw T/ ak 1) = L′ U′.

Define
which is a permutation matrix, since it is the product of two
permutation matrices (Exercise D.1-4 on page 1219). This definition of
P gives
which yields the LUP decomposition. Because L′ is unit lower-
triangular, so is L, and because U′ is upper-triangular, so is U.
Notice that in this derivation, unlike the one for LU decomposition,
both the column vector v/ ak 1 and the Schur complement A′ − vw T/ ak 1
are multiplied by the permutation matrix P′. The procedure LUP-
DECOMPOSITION gives the pseudocode for LUP decomposition.
LUP-DECOMPOSITION( A, n)
1 let π[1 : n] be a new array
2 for i = 1 to n
3
π[ i] = i
// initialize π to the identity permutation
4 for k = 1 to n
5
p = 0
6
for i = k to n
// find largest absolute value in column
k
7
if | aik| > p
8
p = | aik|
9
k′ = i
// row number of the largest found so
far
10
if p == 0
11
error “singular matrix”
12
exchange π[ k] with π[ k′]
13
for i = 1 to n
14
exchange aki with// exchange rows k and k′
ak′i
15
for i = k + 1 to n
16
aik = aik/ akk
17
for j = k + 1 to n
18
aij = aij − aikakj // compute L and U in place in A Like LU-DECOMPOSITION, the LUP-DECOMPOSITION
procedure replaces the recursion with an iteration loop. As an
improvement over a direct implementation of the recursion, the
procedure dynamically maintains the permutation matrix P as an array
π, where π[ i] = j means that the i th row of P contains a 1 in column j.
The LUP-DECOMPOSITION procedure also implements the
improvement mentioned earlier, computing L and U in place in the matrix A. Thus, when the procedure terminates,
Figure 28.2 illustrates how LUP-DECOMPOSITION factors a
matrix. Lines 2–3 initialize the array π to represent the identity
permutation. The outer for loop of lines 4–18 implements the recursion,
finding an LUP decomposition of the ( n − k + 1) × ( n − k + 1) submatrix whose upper left is in row k and column k. Each time through the outer loop, lines 5–9 determine the element ak′k with the
largest absolute value of those in the current first column (column k) of
the ( n − k + 1) × ( n − k + 1) submatrix that the procedure is currently working on. If all elements in the current first column are 0, lines 10–11
report that the matrix is singular. To pivot, line 12 exchanges π[ k′] with π[ k], and lines 13–14 exchange the k th and k′th rows of A, thereby making the pivot element akk. (The entire rows are swapped because in
the derivation of the method above, not only is A′ − vw T/ ak 1 multiplied by P′, but so is v/ ak 1.) Finally, the Schur complement is computed by lines 15–18 in much the same way as it is computed by lines 6–11 of LU-DECOMPOSITION, except that here the operation is written to work
in place.
Figure 28.2 The operation of LUP-DECOMPOSITION. (a) The input matrix A with the identity permutation of the rows in yellow on the left. The first step of the algorithm determines that the element 5 highlighted in blue in the third row is the pivot for the first column. (b) Rows 1 and 3 are swapped and the permutation is updated. The tan column and row represent v and w T. (c) The vector v is replaced by v/5, and the lower right of the matrix is updated with the Schur complement. Lines divide the matrix into three regions: elements of U (above), elements of L (left), and elements of the Schur complement (lower right). (d)–(f) The second step. (g)–(i) The third step. No further changes occur on the fourth (final) step. (j) The LUP decomposition PA = LU.


Because of its triply nested loop structure, LUP-
DECOMPOSITION has a running time of Θ( n 3), which is the same as
that of LU-DECOMPOSITION. Thus, pivoting costs at most a
constant factor in time.
Exercises
28.1-1
Solve the equation
by using forward substitution.
28.1-2
Find an LU decomposition of the matrix
28.1-3
Solve the equation
by using an LUP decomposition.
28.1-4
Describe the LUP decomposition of a diagonal matrix.
28.1-5
Describe the LUP decomposition of a permutation matrix, and prove
that it is unique.
28.1-6
Show that for all n ≥ 1, there exists a singular n × n matrix that has an LU decomposition.
28.1-7
In LU-DECOMPOSITION, is it necessary to perform the outermost
for loop iteration when k = n? How about in LUP-
DECOMPOSITION?
Although you can use equation (28.3) to solve a system of linear
equations by computing a matrix inverse, in practice you are better off
using more numerically stable techniques, such as LUP decomposition.
Sometimes, however, you really do need to compute a matrix inverse.
This section shows how to use LUP decomposition to compute a matrix
inverse. It also proves that matrix multiplication and computing the
inverse of a matrix are equivalently hard problems, in that (subject to
technical conditions) an algorithm for one can solve the other in the
same asymptotic running time. Thus, you can use Strassen’s algorithm
(see Section 4.2) for matrix multiplication to invert a matrix. Indeed, Strassen’s original paper was motivated by the idea that a set of a linear
equations could be solved more quickly than by the usual method.
Computing a matrix inverse from an LUP decomposition
Suppose that you have an LUP decomposition of a matrix A in the form
of three matrices L, U, and P such that PA = LU. Using LUP-SOLVE, you can solve an equation of the form Ax = b in Θ( n 2) time. Since the LUP decomposition depends on A but not b, you can run LUP-SOLVE
on a second set of equations of the form Ax = b′ in Θ( n 2) additional time. In general, once you have the LUP decomposition of A, you can
solve, in Θ( kn 2) time, k versions of the equation Ax = b that differ only in the vector b.
Let’s think of the equation
which defines the matrix X, the inverse of A, as a set of n distinct equations of the form Ax = b. To be precise, let Xi denote the i th
column of X, and recall that the unit vector ei is the i th column of In.
You can then solve equation (28.11) for X by using the LUP
decomposition for A to solve each equation
AXi = ei
separately for Xi. Once you have the LUP decomposition, you can
compute each of the n columns Xi in Θ( n 2) time, and so you can compute X from the LUP decomposition of A in Θ( n 3) time. Since you find the LUP decomposition of A in Θ( n 3) time, you can compute the
inverse A−1 of a matrix A in Θ( n 3) time.
Matrix multiplication and matrix inversion
Now let’s see how the theoretical speedups obtained for matrix
multiplication translate to speedups for matrix inversion. In fact, we’ll
prove something stronger: matrix inversion is equivalent to matrix
multiplication, in the following sense. If M( n) denotes the time to multiply two n × n matrices, then a nonsingular n × n matrix can be inverted in O( M( n)) time. Moreover, if I( n) denotes the time to invert a nonsingular n × n matrix, then two n × n matrices can be multiplied in O( I( n)) time. We prove these results as two separate theorems.
Theorem 28.1 (Multiplication is no harder than inversion)
If an n × n matrix can be inverted in I( n) time, where I( n) = Ω( n 2) and I( n) satisfies the regularity condition I(3 n) = O( I( n)), then two n × n matrices can be multiplied in O( I( n)) time.
Proof Let A and B be n × n matrices. To compute their product C =
AB, define the 3 n × 3 n matrix D by
The inverse of D is
and thus to compute the product AB, just take the upper right n × n submatrix of D−1.
Constructing matrix D takes Θ( n 2) time, which is O( I( n)) from the assumption that I( n) = Ω( n 2), and inverting D takes O( I(3 n)) = O( I( n)) time, by the regularity condition on I( n). We thus have M( n) = O( I( n)).
▪
Note that I( n) satisfies the regularity condition whenever I( n) = Θ( nc lg dn) for any constants c > 0 and d ≥ 0.
The proof that matrix inversion is no harder than matrix
multiplication relies on some properties of symmetric positive-definite
matrices proved in Section 28.3.
Theorem 28.2 (Inversion is no harder than multiplication)
Suppose that two n × n real matrices can be multiplied in M( n) time, where M( n) = Ω( n 2) and M( n) satisfies the following two regularity conditions:
1. M( n + k) = O( M( n)) for any k in the range 0 ≤ k < n, and 2. M( n/2) ≤ cM( n) for some constant c < 1/2.
Then the inverse of any real nonsingular n× n matrix can be computed in
O( M( n)) time.
Proof Let A be an n × n matrix with real-valued entries that is nonsingular. Assume that n is an exact power of 2 (i.e., n = 2 l for some integer l); we’ll see at the end of the proof what to do if n is not an exact power of 2.
For the moment, assume that the n × n matrix A is symmetric and positive-definite. Partition each of A and its inverse A−1 into four n/2 ×
n/2 submatrices:


Then, if we let
be the Schur complement of A with respect to B (we’ll see more about
this form of Schur complement in Section 28.3), we have
since AA−1 = In, as you can verify by performing the matrix
multiplication. Because A is symmetric and positive-definite, Lemmas
28.4 and 28.5 in Section 28.3 imply that B and S are both symmetric and positive-definite. By Lemma 28.3 in Section 28.3, therefore, the inverses B−1 and S−1 exist, and by Exercise D.2-6 on page 1223, B−1
and S−1 are symmetric, so that ( B−1)T = B−1 and ( S−1)T = S−1.
Therefore, to compute the submatrices
R = B−1 + B−1 C T S−1 CB−1,
T = − B−1 C T S−1,
U = − S−1 CB−1, and
V = S−1
of A−1, do the following, where all matrices mentioned are n/2 × n/2: 1. Form the submatrices B, C, C T, and D of A.
2. Recursively compute the inverse B−1 of B.
3. Compute the matrix product W = CB−1, and then compute its
transpose W T, which equals B−1 C T (by Exercise D.1-2 on page
1219 and ( B−1)T = B−1).
4. Compute the matrix product X = WC T, which equals CB−1 C T, and then compute the matrix S = D − X = D − CB−1 C T.
5. Recursively compute the inverse S−1 of S.
6. Compute the matrix product Y = S−1 W, which equals
S−1 CB−1, and then compute its transpose Y T, which equals B−1 C T S−1 (by Exercise D.1-2, ( B−1)T = B−1, and ( S−1)T =
S−1).
7. Compute the matrix product Z = W T Y, which equals
B−1 C T S−1 CB−1.
8. Set R = B−1 + Z.
9. Set T = − Y T.
10. Set U = − Y.
11. Set V = S−1.
Thus, to invert an n× n symmetric positive-definite matrix, invert two
n/2× n/2 matrices in steps 2 and 5; perform four multiplications of n/2 ×
n/2 matrices in steps 3, 4, 6, and 7; plus incur an additional cost of O( n 2) for extracting submatrices from A, inserting submatrices into A−1, and performing a constant number of additions, subtractions, and
transposes on n/2 × n/2 matrices. The running time is given by the recurrence
The second line follows from the assumption that M( n) = Ω( n 2) and from the second regularity condition in the statement of the theorem,
which implies that 4 M( n/2) < 2 M( n). Because M( n) = Ω( n 2), case 3 of

the master theorem (Theorem 4.1) applies to the recurrence (28.15),
giving the O( M( n)) result.
It remains to prove how to obtain the same asymptotic running time
for matrix multiplication as for matrix inversion when A is invertible but
not symmetric and positive-definite. The basic idea is that for any
nonsingular matrix A, the matrix A T A is symmetric (by Exercise D.1-2) and positive-definite (by Theorem D.6 on page 1222). The trick, then, is
to reduce the problem of inverting A to the problem of inverting A T A.
The reduction is based on the observation that when A is an n × n
nonsingular matrix, we have
A−1 = ( A T A)−1 A T,
since (( A T A)−1 A T) A = ( A T A)−1( A T A) = In and a matrix inverse is unique. Therefore, to compute A−1, first multiply A T by A to obtain A T A, then invert the symmetric positive-definite matrix A T A using the above divide-and-conquer algorithm, and finally multiply the result by
A T. Each of these three steps takes O( M( n)) time, and thus any nonsingular matrix with real entries can be inverted in O( M( n)) time.
The above proof assumed that A is an n × n matrix, where n is an exact power of 2. If n is not an exact power of 2, then let k < n be such that n + k is an exact power of 2, and define the ( n + k) × ( n + k) matrix A′ as
Then the inverse of A′ is
Apply the method of the proof to A′ to compute the inverse of A′, and
take the first n rows and n columns of the result as the desired answer A−1. The first regularity condition on M( n) ensures that enlarging the matrix in this way increases the running time by at most a constant
factor.
The proof of Theorem 28.2 suggests how to solve the equation Ax =
b by using LU decomposition without pivoting, so long as A is nonsingular. Let y = A T b. Multiply both sides of the equation Ax = b by A T, yielding ( A T A) x = A T b = y. This transformation doesn’t affect the solution x, since A T is invertible. Because A T A is symmetric positive-definite, it can be factored by computing an LU decomposition.
Then, use forward and back substitution to solve for x in the equation
( A T A) x = y. Although this method is theoretically correct, in practice the procedure LUP-DECOMPOSITION works much better. LUP
decomposition requires fewer arithmetic operations by a constant
factor, and it has somewhat better numerical properties.
Exercises
28.2-1
Let M( n) be the time to multiply two n × n matrices, and let S( n) denote the time required to square an n × n matrix. Show that multiplying and
squaring matrices have essentially the same difficulty: an M( n)-time matrix-multiplication algorithm implies an O( M( n))-time squaring algorithm, and an S( n)-time squaring algorithm implies an O( S( n))-time matrix-multiplication algorithm.
28.2-2
Let M( n) be the time to multiply two n × n matrices. Show that an M( n)-
time matrix-multiplication algorithm implies an O( M( n))-time LUP-decomposition algorithm. (The LUP decomposition your method
produces need not be the same as the result produced by the LUP-
DECOMPOSITION procedure.)
28.2-3
Let M( n) be the time to multiply two n × n boolean matrices, and let T( n) be the time to find the transitive closure of an n × n boolean matrix. (See Section 23.2. ) Show that an M( n)-time boolean matrix-multiplication algorithm implies an O( M( n) lg n)-time transitive-closure
algorithm, and a T( n)-time transitive-closure algorithm implies an O( T
( n))-time boolean matrix-multiplication algorithm.
28.2-4
Does the matrix-inversion algorithm based on Theorem 28.2 work when
matrix elements are drawn from the field of integers modulo 2? Explain.
★ 28.2-5
Generalize the matrix-inversion algorithm of Theorem 28.2 to handle
matrices of complex numbers, and prove that your generalization works
correctly. ( Hint: Instead of the transpose of A, use the conjugate transpose A*, which you obtain from the transpose of A by replacing every entry with its complex conjugate. Instead of symmetric matrices,
consider Hermitian matrices, which are matrices A such that A = A*.)
28.3 Symmetric positive-definite matrices and least-squares
Symmetric positive-definite matrices have many interesting and
desirable properties. An n × n matrix A is symmetric positive-definite if A
= A T( A is symmetric) and x T Ax > 0 for all n-vectors x ≠ 0 ( A is positive-definite). Symmetric positive-definite matrices are nonsingular,
and an LU decomposition on them will not divide by 0. This section
proves these and several other important properties of symmetric
positive-definite matrices. We’ll also see an interesting application to
curve fitting by a least-squares approximation.
The first property we prove is perhaps the most basic.
Lemma 28.3
Any positive-definite matrix is nonsingular.
Proof Suppose that a matrix A is singular. Then by Corollary D.3 on
page 1221, there exists a nonzero vector x such that Ax = 0. Hence, x T Ax = 0, and A cannot be positive-definite.
▪




The proof that an LU decomposition on a symmetric positive-
definite matrix A won’t divide by 0 is more involved. We begin by
proving properties about certain submatrices of A. Define the k th leading submatrix of A to be the matrix Ak consisting of the intersection of the first k rows and first k columns of A.
Lemma 28.4
If A is a symmetric positive-definite matrix, then every leading
submatrix of A is symmetric and positive-definite.
Proof Since A is symmetric, each leading submatrix Ak is also symmetric. We’ll prove that Ak is positive-definite by contradiction. If
Ak is not positive-definite, then there exists a k-vector xk ≠ 0 such that
. Let A be n × n, and
for submatrices B (which is ( n− k)× k) and C (which is ( n− k)×( n− k)).
Define the n-vector
, where n − k 0s follow xk. Then we have
which contradicts A being positive-definite.
▪
We now turn to some essential properties of the Schur complement.
Let A be a symmetric positive-definite matrix, and let Ak be a leading k
× k submatrix of A. Partition A once again according to equation (28.16). Equation (28.10) generalizes to define the Schur complement S
of A with respect to Ak as





(By Lemma 28.4, Ak is symmetric and positive-definite, and therefore,
exists by Lemma 28.3, and S is well defined.) The earlier definition
(28.10) of the Schur complement is consistent with equation (28.17) by
letting k = 1.
The next lemma shows that the Schur-complement matrices of
symmetric positive-definite matrices are themselves symmetric and
positive-definite. We used this result in Theorem 28.2, and its corollary
will help prove that LU decomposition works for symmetric positive-
definite matrices.
Lemma 28.5 (Schur complement lemma)
If A is a symmetric positive-definite matrix and Ak is a leading k × k submatrix of A, then the Schur complement S of A with respect to Ak is symmetric and positive-definite.
Proof Because A is symmetric, so is the submatrix C. By Exercise D.2-6 on page 1223, the product
is symmetric. Since C and
are
symmetric, then by Exercise D.1-1 on page 1219, so is S.
It remains to show that S is positive-definite. Consider the partition
of A given in equation (28.16). For any nonzero vector x, we have x T Ax
> 0 by the assumption that A is positive-definite. Let the subvectors y
and z consist of the first k and last n − k elements in x, respectively, and thus they are compatible with Ak and C, respectively. Because exists, we have
This last equation, which you can verify by multiplying through,
amounts to “completing the square” of the quadratic form. (See
Exercise 28.3-2.)
Since x T Ax > 0 holds for any nonzero x, pick any nonzero z and then choose
, which causes the first term in equation (28.18)


to vanish, leaving
as the value of the expression. For any z ≠ 0, we therefore have z T Sz =
x T Ax > 0, and thus S is positive-definite.
▪
Corollary 28.6
LU decomposition of a symmetric positive-definite matrix never causes
a division by 0.
Proof Let A be an n × n symmetric positive-definite matrix. In fact, we’ll prove a stronger result than the statement of the corollary: every
pivot is strictly positive. The first pivot is a 11. Let e 1 be the length- n unit vector ( 1 0 0 ⋯ 0 )T, so that
, which is positive because e 1 is
nonzero and A is positive definite. Since the first step of LU
decomposition produces the Schur complement of A with respect to A 1
= ( a 11), Lemma 28.5 implies by induction that all pivots are positive.
▪
Least-squares approximation
One important application of symmetric positive-definite matrices arises
in fitting curves to given sets of data points. You are given a set of m
data points
( x 1, y 1), ( x 2, y 2), … , ( xm, ym),
where you know that the yi are subject to measurement errors. You wish
to determine a function F( x) such that the approximation errors
are small for i = 1, 2, … , m. The form of the function F depends on the problem at hand. Let’s assume that it has the form of a linearly
weighted sum

where the number n of summands and the specific basis functions fj are chosen based on knowledge of the problem at hand. A common choice
is fj( x) = xj−1, which means that
F( x) = c 1 + c 2 x + c 3 x 2 + ⋯ + cnxn−1
is a polynomial of degree n − 1 in x. Thus, if you are given m data points ( x 1, y 1), ( x 2, y 2), … , ( xm, ym), you need to calculate n coefficients c 1, c 2, … , cn that minimize the approximation errors η 1, η 2, … , ηm.
By choosing n = m, you can calculate each yi exactly in equation (28.19). Such a high-degree polynomial F “fits the noise” as well as the
data, however, and generally gives poor results when used to predict y
for previously unseen values of x. It is usually better to choose n significantly smaller than m and hope that by choosing the coefficients
cj well, you can obtain a function F that finds the significant patterns in the data points without paying undue attention to the noise. Some
theoretical principles exist for choosing n, but they are beyond the scope
of this text. In any case, once you choose a value of n that is less than m, you end up with an overdetermined set of equations whose solution you
wish to approximate. Let’s see how to do so.
Let
denote the matrix of values of the basis functions at the given points,
that is, aij = fj( xi). Let c = ( ck) denote the desired n-vector of coefficients. Then,




is the m-vector of “predicted values” for y. Thus,
η = Ac − y
is the m-vector of approximation errors.
To minimize approximation errors, let’s minimize the norm of the
error vector η, which gives a least-squares solution, since
Because
to minimize ∥ η∥, differentiate ∥ η∥2 with respect to each ck and then set the result to 0:
The n equations (28.20) for k = 1, 2, … , n are equivalent to the single matrix equation
( Ac − y)T A = 0
or, equivalently (using Exercise D.1-2 on page 1219), to
A T( Ac − y) = 0,
which implies

In statistics, equation (28.21) is called the normal equation. The matrix
A T A is symmetric by Exercise D.1-2, and if A has full column rank, then by Theorem D.6 on page 1222, A T A is positive-definite as well.
Hence, ( A T A)−1 exists, and the solution to equation (28.21) is
where the matrix A+ = (( A T A)−1 A T) is the pseudoinverse of the matrix A. The pseudoinverse naturally generalizes the notion of a matrix
inverse to the case in which A is not square. (Compare equation (28.22)
as the approximate solution to Ac = y with the solution A−1 b as the exact solution to Ax = b.)
As an example of producing a least-squares fit, suppose that you
have five data points
( x 1, y 1) = (−1, 2),
( x 2, y 2) = (1, 1),
( x 3, y 3) = (2, 1),
( x 4, y 4) = (3, 0),
( x 5, y 5) = (5, 3),
shown as orange dots in Figure 28.3, and you want to fit these points with a quadratic polynomial
F( x) = c 1 + c 2 x + c 3 x 2.
Start with the matrix of basis-function values
whose pseudoinverse is


Figure 28.3 The least-squares fit of a quadratic polynomial to the set of five data points {(−1, 2), (1, 1), (2, 1), (3, 0), (5, 3)}. The orange dots are the data points, and the blue dots are their estimated values predicted by the polynomial F( x) = 1.2 − 0.757 x + 0.214 x 2, the quadratic polynomial that minimizes the sum of the squared errors, plotted in blue. Each orange line shows the error for one data point.
Multiplying y by A+ gives the coefficient vector
which corresponds to the quadratic polynomial
F( x) = 1.200 − 0.757 x + 0.214 x 2
as the closest-fitting quadratic to the given data, in a least-squares sense.
As a practical matter, you would typically solve the normal equation
(28.21) by multiplying y by A T and then finding an LU decomposition
of A T A. If A has full rank, the matrix A T A is guaranteed to be nonsingular, because it is symmetric and positive-definite. (See Exercise
D.1-2 and Theorem D.6.)
Figure 28.4 A least-squares fit of a curve of the form
c 1 + c 2 x + c 3 x 2 + c 4 sin(2 πx) + c 5 cos(2 πx) for the carbon-dioxide concentrations measured in Mauna Loa, Hawaii from 19901 to 2019, where x is the number of years elapsed since 1990. This curve is the famous “Keeling curve,”
illustrating curve-fitting to nonpolynomial formulas. The sine and cosine terms allow modeling of seasonal variations in CO2 concentrations. The red curve shows the measured CO2
concentrations. The best fit, shown in black, has the form
352.83 + 1.39 x + 0.02 x 2 + 2.83 sin(2 πx) − 0.94 cos(2 πx).
We close this section with an example in Figure 28.4, illustrating that a curve can also fit a nonpolynomial function. The curve confirms one
aspect of climate change: that carbon dioxide (CO2) concentrations
have steadily increased over a period of 29 years. Linear and quadratic
terms model the annual increase, and sine and cosine terms model
seasonal variations.
Exercises
28.3-1
Prove that every diagonal element of a symmetric positive-definite
matrix is positive.
28.3-2
Let
be a 2 × 2 symmetric positive-definite matrix. Prove that
its determinant ac − b 2 is positive by “completing the square” in a manner similar to that used in the proof of Lemma 28.5.
28.3-3
Prove that the maximum element in a symmetric positive-definite matrix
lies on the diagonal.
28.3-4
Prove that the determinant of each leading submatrix of a symmetric
positive-definite matrix is positive.
28.3-5
Let Ak denote the k th leading submatrix of a symmetric positive-definite matrix A. Prove that det( Ak)/det( Ak−1) is the k th pivot during LU decomposition, where, by convention, det( A 0) = 1.
28.3-6
Find the function of the form
F( x) = c 1 + c 2 x lg x + c 3 ex that is the best least-squares fit to the data points
(1, 1), (2, 1), (3, 3), (4, 8).
28.3-7
Show that the pseudoinverse A+ satisfies the following four equations:
AA+ A = A,
A+ AA+ = A+,
( AA+)T = AA+,
( A+ A)T = A+ A.
Problems
28-1 Tridiagonal systems of linear equations
Consider the tridiagonal matrix
a. Find an LU decomposition of A.
b. Solve the equation Ax = ( 1 1 1 1 1 )T by using forward and back
substitution.
c. Find the inverse of A.
d. Show how to solve the equation Ax = b for any n × n symmetric positive-definite, tridiagonal matrix A and any n-vector b in O( n) time by performing an LU decomposition. Argue that any method based
on forming A−1 is asymptotically more expensive in the worst case.
e. Show how to solve the equation Ax = b for any n × n nonsingular, tridiagonal matrix A and any n-vector b in O( n) time by performing an LUP decomposition.
28-2 Splines
A practical method for interpolating a set of points with a curve is to
use cubic splines. You are given a set {( xi, yi) : i = 0, 1, … , n} of n + 1
point-value pairs, where x 0 < x 1 < ⋯ < xn. Your goal is to fit a piecewise-cubic curve (spline) f( x) to the points. That is, the curve f( x) is made up of n cubic polynomials fi( x) = ai + bix + cix 2 + dix 3 for i = 0, 1, … , n − 1, where if x falls in the range xi ≤ x ≤ xi+1, then the value of the curve is given by f( x) = fi( x − xi). The points xi at which the cubic





polynomials are “pasted” together are called knots. For simplicity,
assume that xi = i for i = 0, 1, … , n.
To ensure continuity of f( x), require that
f( xi) = fi(0) = yi,
f( xi+1) = fi(1) = yi+1
for i = 0, 1, … , n − 1. To ensure that f( x) is sufficiently smooth, also require the first derivative to be continuous at each knot:
for i = 0, 1, … , n − 2.
a. Suppose that for i = 0, 1, … , n, in addition to the point-value pairs
{( xi, yi)}, you are also given the first derivative Di = f′( xi) at each knot. Express each coefficient ai, bi, ci, and di in terms of the values yi, yi+1, Di, and Di+1. (Remember that xi = i.) How quickly can you compute the 4 n coefficients from the point-value pairs and first
derivatives?