…, ki−1: no keys at all. Bear in mind, however, that subtrees also
contain dummy keys. We adopt the convention that a subtree
containing keys ki, …, ki−1 has no actual keys but does contain the single dummy key di−1. Symmetrically, if you select kj as the root, then kj’s right subtree contains the keys kj+1, …, kj. This right subtree contains no actual keys, but it does contain the dummy key dj.
Step 2: A recursive solution

To define the value of an optimal solution recursively, the subproblem
domain is finding an optimal binary search tree containing the keys ki,
…, kj, where i ≥ 1, j ≤ n, and j ≥ i − 1. (When j = i − 1, there is just the dummy key di−1, but no actual keys.) Let e[ i, j] denote the expected cost of searching an optimal binary search tree containing the keys ki, …, kj.
Your goal is to compute e[1, n], the expected cost of searching an optimal binary search tree for all the actual and dummy keys.
The easy case occurs when j = i − 1. Then the subproblem consists of
just the dummy key di−1. The expected search cost is e[ i, i − 1] = qi−1.
When j ≥ i, you need to select a root kr from among ki, …, kj and then make an optimal binary search tree with keys ki, …, kr−1 as its left subtree and an optimal binary search tree with keys kr+1, …, kj as its
right subtree. What happens to the expected search cost of a subtree
when it becomes a subtree of a node? The depth of each node in the
subtree increases by 1. By equation (14.11), the expected search cost of
this subtree increases by the sum of all the probabilities in the subtree.
For a subtree with keys ki, …, kj, denote this sum of probabilities as
Thus, if kr is the root of an optimal subtree containing keys ki, …, kj, we have
e[ i, j] = pr + ( e[ i, r − 1] + w( i, r − 1)) + ( e[ r + 1, j] + w( r + 1, j)).
Noting that
w( i, j) = w( i, r − 1) + pr + w( r + 1, j), we rewrite e[ i, j] as
The recursive equation (14.13) assumes that you know which node kr
to use as the root. Of course, you choose the root that gives the lowest
expected search cost, giving the final recursive formulation:
The e[ i, j] values give the expected search costs in optimal binary search trees. To help keep track of the structure of optimal binary
search trees, define root[ i, j], for 1 ≤ i ≤ j ≤ n, to be the index r for which kr is the root of an optimal binary search tree containing keys ki, …, kj.
Although we’ll see how to compute the values of root[ i, j], the construction of an optimal binary search tree from these values is left as
Exercise 14.5-1.
Step 3: Computing the expected search cost of an optimal binary search
tree
At this point, you may have noticed some similarities between our
characterizations of optimal binary search trees and matrix-chain
multiplication. For both problem domains, the subproblems consist of
contiguous index subranges. A direct, recursive implementation of
equation (14.14) would be just as inefficient as a direct, recursive matrix-
chain multiplication algorithm. Instead, you can store the e[ i, j] values in a table e[1 : n + 1, 0 : n]. The first index needs to run to n + 1 rather than n because in order to have a subtree containing only the dummy
key dn, you need to compute and store e[ n + 1, n]. The second index needs to start from 0 because in order to have a subtree containing only
the dummy key d 0, you need to compute and store e[1, 0]. Only the entries e[ i, j] for which j ≥ i − 1 are filled in. The table root[ i, j] records the root of the subtree containing keys ki, …, kj and uses only the entries for which 1 ≤ i ≤ j ≤ n.
One other table makes the dynamic-programming algorithm a little
faster. Instead of computing the value of w( i, j) from scratch every time you compute e[ i, j], which would take Θ( j − i) additions, store these values in a table w[1 : n + 1, 0 : n]. For the base case, compute w[ i, i − 1]
= qi−1 for 1 ≤ i ≤ n + 1. For j ≥ i, compute
Thus, you can compute the Θ( n 2) values of w[ i, j] in Θ(1) time each.
The OPTIMAL-BST procedure on the next page takes as inputs the
probabilities p 1, …, pn and q 0, …, qn and the size n, and it returns the tables e and root. From the description above and the similarity to the MATRIX-CHAIN-ORDER procedure in Section 14.2, you should find
the operation of this procedure to be fairly straightforward. The for
loop of lines 2–4 initializes the values of e[ i, i − 1]and w[ i, i − 1]. Then the for loop of lines 5–14 uses the recurrences (14.14) and (14.15) to
compute e[ i, j] and w[ i, j] for all 1 ≤ i ≤ j ≤ n. In the first iteration, when l
= 1, the loop computes e[ i, i] and w[ i, i] for i = 1, 2, …, n. The second iteration, with l = 2, computes e[ i, i + 1] and w[ i, i + 1] for i = 1, 2, …, n
− 1, and so on. The innermost for loop, in lines 10–14, tries each
candidate index r to determine which key kr to use as the root of an optimal binary search tree containing keys ki, …, kj. This for loop saves the current value of the index r in root[ i, j] whenever it finds a better key to use as the root.
OPTIMAL-BST( p, q, n)
1let e[1 : n + 1, 0 : n], w[1 : n + 1, 0 : n], and root[1 : n, 1 : n] be new tables
2for i = 1 to n + 1
// base cases
3
e[ i, i − 1] = qi−1
// equation (14.14)
4
w[ i, i − 1] = qi−1
5for l = 1 to n
6
for i = 1 to n − l + 1
7
j = i + l − 1
8
e[ i, j] = ∞
9
w[ i, j] = w[ i, j − 1] + pj + qj // equation (14.15) 10
for r = i to j
// try all possible roots r
11
t = e[ i, r − 1] + e[ r + 1, j] + w[ i, j] // equation (14.14) 12
if t < e[ i, j]
// new minimum?
13
e[ i, j] = t
root[ i, j] = r
15return e and root
Figure 14.10 shows the tables e[ i, j], w[ i, j], and root[ i, j] computed by
the procedure OPTIMAL-BST on the key distribution shown in Figure
14.9. As in the matrix-chain multiplication example of Figure 14.5, the
tables are rotated to make the diagonals run horizontally. OPTIMAL-
BST computes the rows from bottom to top and from left to right
within each row.
The OPTIMAL-BST procedure takes Θ( n 3) time, just like
MATRIX-CHAIN-ORDER. Its running time is O( n 3), since its for
loops are nested three deep and each loop index takes on at most n
values. The loop indices in OPTIMAL-BST do not have exactly the
same bounds as those in MATRIX-CHAIN-ORDER, but they are
within at most 1 in all directions. Thus, like MATRIX-CHAIN-
ORDER, the OPTIMAL-BST procedure takes Ω( n 3) time.
Figure 14.10 The tables e[ i, j], w[ i, j], and root[ i, j] computed by OPTIMAL-BST on the key distribution shown in Figure 14.9. The tables are rotated so that the diagonals run horizontally.
Exercises
14.5-1
Write pseudocode for the procedure CONSTRUCT-OPTIMAL-
BST( root, n) which, given the table root[1 : n, 1 : n], outputs the
structure of an optimal binary search tree. For the example in Figure
14.10, your procedure should print out the structure
k 2 is the root
k 1 is the left child of k 2
d 0 is the left child of k 1
d 1 is the right child of k 1
k 5 is the right child of k 2
k 3 is the left child of k 4
d 2 is the left child of k 3
d 3 is the right child of k 3
d 4 is the right child of k 4
d 5 is the right child of k 5
corresponding to the optimal binary search tree shown in Figure
14.5-2
Determine the cost and structure of an optimal binary search tree for a
set of n = 7 keys with the following probabilities:
i
0
1
2
3
4
5
6
7
pi
0.04
0.06
0.08
0.02
0.10
0.12
0.14
qi 0.06 0.06 0.06 0.06 0.05 0.05 0.05
0.05
14.5-3
Suppose that instead of maintaining the table w[ i, j], you computed the value of w( i, j) directly from equation (14.12) in line 9 of OPTIMAL-BST and used this computed value in line 11. How would this change
affect the asymptotic running time of OPTIMAL-BST?
★ 14.5-4
Knuth [264] has shown that there are always roots of optimal subtrees such that root[ i, j − 1] ≤ root[ i, j] ≤ root[ i + 1, j] for all 1 ≤ i < j ≤ n. Use this fact to modify the OPTIMAL-BST procedure to run in Θ( n 2) time.
Problems
14-1 Longest simple path in a directed acyclic graph
You are given a directed acyclic graph G = ( V, E) with real-valued edge weights and two distinguished vertices s and t. The weight of a path is the sum of the weights of the edges in the path. Describe a dynamic-
programming approach for finding a longest weighted simple path from
s to t. What is the running time of your algorithm?
14-2 Longest palindrome subsequence
A palindrome is a nonempty string over some alphabet that reads the
same forward and backward. Examples of palindromes are all strings of
length 1, civic, racecar, and aibohphobia (fear of palindromes).
Give an efficient algorithm to find the longest palindrome that is a
subsequence of a given input string. For example, given the input
character, your algorithm should return carac. What is the running
time of your algorithm?
14-3 Bitonic euclidean traveling-salesperson problem
In the euclidean traveling-salesperson problem, you are given a set of n
points in the plane, and your goal is to find the shortest closed tour that
connects all n points.
Figure 14.11 Seven points in the plane, shown on a unit grid. (a) The shortest closed tour, with length approximately 24.89. This tour is not bitonic. (b) The shortest bitonic tour for the same set of points. Its length is approximately 25.58.
Figure 14.11(a) shows the solution to a 7-point problem. The general problem is NP-hard, and its solution is therefore believed to require
more than polynomial time (see Chapter 34).
J. L. Bentley has suggested simplifying the problem by considering
only bitonic tours, that is, tours that start at the leftmost point, go strictly rightward to the rightmost point, and then go strictly leftward
back to the starting point. Figure 14.11(b) shows the shortest bitonic
tour of the same 7 points. In this case, a polynomial-time algorithm is
possible.
Describe an O( n 2)-time algorithm for determining an optimal
bitonic tour. You may assume that no two points have the same x-
coordinate and that all operations on real numbers take unit time.
( Hint: Scan left to right, maintaining optimal possibilities for the two
parts of the tour.)
14-4 Printing neatly
Consider the problem of neatly printing a paragraph with a
monospaced font (all characters having the same width). The input text
is a sequence of n words of lengths l 1, l 2, …, ln, measured in characters, which are to be printed neatly on a number of lines that hold a
maximum of M characters each. No word exceeds the line length, so
that li ≤ M for i = 1, 2, …, n. The criterion of “neatness” is as follows. If a given line contains words i through j, where i ≤ j, and exactly one space appears between words, then the number of extra space characters
at the end of the line is
, which must be nonnegative
so that the words fit on the line. The goal is to minimize the sum, over
all lines except the last, of the cubes of the numbers of extra space
characters at the ends of lines. Give a dynamic-programming algorithm
to print a paragraph of n words neatly. Analyze the running time and
space requirements of your algorithm.
14-5 Edit distance
In order to transform a source string of text x[1 : m] to a target string y[1 : n], you can perform various transformation operations. The goal is, given x and y, to produce a series of transformations that changes x to y. An array z—assumed to be large enough to hold all the characters it
needs—holds the intermediate results. Initially, z is empty, and at
termination, you should have z[ j] = y[ j] for j = 1, 2, …, n. The procedure for solving this problem maintains current indices i into x and j into z, and the operations are allowed to alter z and these indices. Initially, i = j
= 1. Every character in x must be examined during the transformation,
which means that at the end of the sequence of transformation operations, i = m + 1.
You may choose from among six transformation operations, each of
which has a constant cost that depends on the operation:
Copy a character from x to z by setting z[ j] = x[ i] and then incrementing both i and j. This operation examines x[ i] and has cost QC.
Replace a character from x by another character c, by setting z[ j] = c, and then incrementing both i and j. This operation examines x[ i] and has cost QR.
Delete a character from x by incrementing i but leaving j alone. This operation examines x[ i] and has cost QD.
Insert the character c into z by setting z[ j] = c and then incrementing j, but leaving i alone. This operation examines no characters of x and has cost QI.
Twiddle (i.e., exchange) the next two characters by copying them from x
to z but in the opposite order: setting z[ j] = x[ i + 1] and z[ j + 1] = x[ i], and then setting i = i + 2 and j = j + 2. This operation examines x[ i]
and x[ i + 1] and has cost QT.
Kill the remainder of x by setting i = m + 1. This operation examines all characters in x that have not yet been examined. This operation, if
performed, must be the final operation. It has cost QK.
Figure 14.12 gives one way to transform the source string
algorithm to the target string altruistic. Several other sequences
of transformation operations can transform algorithm to
altruistic.
Assume that QC < QD + QI and QR < QD + QI, since otherwise, the copy and replace operations would not be used. The cost of a given
sequence of transformation operations is the sum of the costs of the
individual operations in the sequence. For the sequence above, the cost
of transforming algorithm to altruistic is 3 QC + QR + QD +
4 QI + QT + QK.
a. Given two sequences x[1 : m] and y[1 : n] and the costs of the transformation operations, the edit distance from x to y is the cost of the least expensive operation sequence that transforms x to y.
Describe a dynamic-programming algorithm that finds the edit
distance from x[1 : m] to y[1 : n] and prints an optimal operation sequence. Analyze the running time and space requirements of your
algorithm.
Figure 14.12 A sequence of operations that transforms the source algorithm to the target string altruistic. The underlined characters are x[ i] and z[ j] after the operation.
The edit-distance problem generalizes the problem of aligning two
DNA sequences (see, for example, Setubal and Meidanis [405, Section
3.2]). There are several methods for measuring the similarity of two
DNA sequences by aligning them. One such method to align two
sequences x and y consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting
sequences x′ and y′ have the same length but do not have a space in the same position (i.e., for no position j are both x′[ j] and y′[ j] a space).
Then we assign a “score” to each position. Position j receives a score as
follows:
+1 if x′[ j] = y′[ j] and neither is a space,
−1 if x′[ j] ≠ y′[ j] and neither is a space,
−2 if either x′[ j] or y′[ j] is a space.
The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x = GATCGGCAT and y =
CAATGTGAATC, one alignment is
G ATCG GCAT
CAAT GTGAATC
-*++*+*+-++*
A + under a position indicates a score of +1 for that position, a -
indicates a score of −1, and a * indicates a score of −2, so that this
alignment has a total score of 6 · 1 − 2 · 1 − 4 · 2 = −4.
b. Explain how to cast the problem of finding an optimal alignment as
an edit-distance problem using a subset of the transformation
operations copy, replace, delete, insert, twiddle, and kill.
14-6 Planning a company party
Professor Blutarsky is consulting for the president of a corporation that
is planning a company party. The company has a hierarchical structure,
that is, the supervisor relation forms a tree rooted at the president. The
human resources department has ranked each employee with a
conviviality rating, which is a real number. In order to make the party
fun for all attendees, the president does not want both an employee and
his or her immediate supervisor to attend.
Professor Blutarsky is given the tree that describes the structure of
the corporation, using the left-child, right-sibling representation
described in Section 10.3. Each node of the tree holds, in addition to the pointers, the name of an employee and that employee’s conviviality
ranking. Describe an algorithm to make up a guest list that maximizes
the sum of the conviviality ratings of the guests. Analyze the running
time of your algorithm.
14-7 Viterbi algorithm
Dynamic programming on a directed graph can play a part in speech
recognition. A directed graph G = ( V, E) with labeled edges forms a formal model of a person speaking a restricted language. Each edge ( u,
v) ∈ E is labeled with a sound σ( u, v) from a finite set Σ of sounds. Each
directed path in the graph starting from a distinguished vertex v 0 ∈ V
corresponds to a possible sequence of sounds produced by the model,
with the label of a path being the concatenation of the labels of the
edges on that path.
a. Describe an efficient algorithm that, given an edge-labeled directed
graph G with distinguished vertex v 0 and a sequence s = 〈 σ 1, σ 2, …, σk〉 of sounds from Σ, returns a path in G that begins at v 0 and has s as its label, if any such path exists. Otherwise, the algorithm should
return NO-SUCH-PATH. Analyze the running time of your
algorithm. ( Hint: You may find concepts from Chapter 20 useful.) Now suppose that every edge ( u, v) ∈ E has an associated nonnegative probability p( u, v) of being traversed, so that the corresponding sound is produced. The sum of the probabilities of the edges leaving any vertex
equals 1. The probability of a path is defined to be the product of the
probabilities of its edges. Think of the probability of a path beginning at
vertex v 0 as the probability that a “random walk” beginning at v 0
follows the specified path, where the edge leaving a vertex u is taken randomly, according to the probabilities of the available edges leaving u.
b. Extend your answer to part (a) so that if a path is returned, it is a
most probable path starting at vertex v 0 and having label s. Analyze the running time of your algorithm.
14-8 Image compression by seam carving
Suppose that you are given a color picture consisting of an m× n array
A[1 : m, 1 : n] of pixels, where each pixel specifies a triple of red, green, and blue (RGB) intensities. You want to compress this picture slightly,
by removing one pixel from each of the m rows, so that the whole
picture becomes one pixel narrower. To avoid incongruous visual effects,
however, the pixels removed in two adjacent rows must lie in either the
same column or adjacent columns. In this way, the pixels removed form
a “seam” from the top row to the bottom row, where successive pixels in
the seam are adjacent vertically or diagonally.
a. Show that the number of such possible seams grows at least exponentially in m, assuming that n > 1.
b. Suppose now that along with each pixel A[ i, j], you are given a real-valued disruption measure d[ i, j], indicating how disruptive it would be to remove pixel A[ i, j]. Intuitively, the lower a pixel’s disruption measure, the more similar the pixel is to its neighbors. Define the
disruption measure of a seam as the sum of the disruption measures
of its pixels.
Give an algorithm to find a seam with the lowest disruption measure.
How efficient is your algorithm?
14-9 Breaking a string
A certain string-processing programming language allows you to break
a string into two pieces. Because this operation copies the string, it costs
n time units to break a string of n characters into two pieces. Suppose that you want to break a string into many pieces. The order in which the
breaks occur can affect the total amount of time used. For example,
suppose that you want to break a 20-character string after characters 2,
8, and 10 (numbering the characters in ascending order from the left-
hand end, starting from 1). If you program the breaks to occur in left-
to-right order, then the first break costs 20 time units, the second break
costs 18 time units (breaking the string from characters 3 to 20 at
character 8), and the third break costs 12 time units, totaling 50 time
units. If you program the breaks to occur in right-to-left order, however,
then the first break costs 20 time units, the second break costs 10 time
units, and the third break costs 8 time units, totaling 38 time units. In
yet another order, you could break first at 8 (costing 20), then break the
left piece at 2 (costing another 8), and finally the right piece at 10
(costing 12), for a total cost of 40.
Design an algorithm that, given the numbers of characters after
which to break, determines a least-cost way to sequence those breaks.
More formally, given an array L[1 : m] containing the break points for a string of n characters, compute the lowest cost for a sequence of breaks,
along with a sequence of breaks that achieves this cost.
14-10 Planning an investment strategy
Your knowledge of algorithms helps you obtain an exciting job with a
hot startup, along with a $10,000 signing bonus. You decide to invest
this money with the goal of maximizing your return at the end of 10
years. You decide to use your investment manager, G. I. Luvcache, to
manage your signing bonus. The company that Luvcache works with
requires you to observe the following rules. It offers n different
investments, numbered 1 through n. In each year j, investment i provides a return rate of rij. In other words, if you invest d dollars in investment i in year j, then at the end of year j, you have drij dollars. The return rates are guaranteed, that is, you are given all the return rates for the next 10
years for each investment. You make investment decisions only once per
year. At the end of each year, you can leave the money made in the
previous year in the same investments, or you can shift money to other
investments, by either shifting money between existing investments or
moving money to a new investment. If you do not move your money
between two consecutive years, you pay a fee of f 1 dollars, whereas if
you switch your money, you pay a fee of f 2 dollars, where f 2 > f 1. You pay the fee once per year at the end of the year, and it is the same
amount, f 2, whether you move money in and out of only one
investment, or in and out of many investments.
a. The problem, as stated, allows you to invest your money in multiple
investments in each year. Prove that there exists an optimal investment
strategy that, in each year, puts all the money into a single investment.
(Recall that an optimal investment strategy maximizes the amount of
money after 10 years and is not concerned with any other objectives,
such as minimizing risk.)
b. Prove that the problem of planning your optimal investment strategy
exhibits optimal substructure.
c. Design an algorithm that plans your optimal investment strategy.
What is the running time of your algorithm?
d. Suppose that Luvcache’s company imposes the additional restriction
that, at any point, you can have no more than $15,000 in any one
investment. Show that the problem of maximizing your income at the
end of 10 years no longer exhibits optimal substructure.
14-11 Inventory planning
The Rinky Dink Company makes machines that resurface ice rinks. The
demand for such products varies from month to month, and so the
company needs to develop a strategy to plan its manufacturing given
the fluctuating, but predictable, demand. The company wishes to design
a plan for the next n months. For each month i, the company knows the
demand di, that is, the number of machines that it will sell. Let
be the total demand over the next n months. The company keeps a full-
time staff who provide labor to manufacture up to m machines per
month. If the company needs to make more than m machines in a given
month, it can hire additional, part-time labor, at a cost that works out
to c dollars per machine. Furthermore, if the company is holding any
unsold machines at the end of a month, it must pay inventory costs. The
company can hold up to D machines, with the cost for holding j
machines given as a function h( j) for j = 1, 2, …, D that monotonically increases with j.
Give an algorithm that calculates a plan for the company that
minimizes its costs while fulfilling all the demand. The running time
should be polynomial in n and D.
14-12 Signing free-agent baseball players
Suppose that you are the general manager for a major-league baseball
team. During the off-season, you need to sign some free-agent players
for your team. The team owner has given you a budget of $ X to spend
on free agents. You are allowed to spend less than $ X, but the owner will fire you if you spend any more than $ X.
You are considering N different positions, and for each position, P
free-agent players who play that position are available.10 Because you do not want to overload your roster with too many players at any
position, for each position you may sign at most one free agent who
plays that position. (If you do not sign any players at a particular position, then you plan to stick with the players you already have at that
position.)
To determine how valuable a player is going to be, you decide to use
a sabermetric statistic11 known as “WAR,” or “wins above
replacement.” A player with a higher WAR is more valuable than a
player with a lower WAR. It is not necessarily more expensive to sign a
player with a higher WAR than a player with a lower WAR, because
factors other than a player’s value determine how much it costs to sign
them.
For each available free-agent player p, you have three pieces of
information:
the player’s position,
p.cost, the amount of money it costs to sign the player, and
p.war, the player’s WAR.
Devise an algorithm that maximizes the total WAR of the players
you sign while spending no more than $ X. You may assume that each
player signs for a multiple of $100,000. Your algorithm should output
the total WAR of the players you sign, the total amount of money you
spend, and a list of which players you sign. Analyze the running time
and space requirement of your algorithm.
Chapter notes
Bellman [44] began the systematic study of dynamic programming in 1955, publishing a book about it in 1957. The word “programming,”
both here and in linear programming, refers to using a tabular solution
method. Although optimization techniques incorporating elements of
dynamic programming were known earlier, Bellman provided the area
with a solid mathematical basis.
Galil and Park [172] classify dynamic-programming algorithms according to the size of the table and the number of other table entries
each entry depends on. They call a dynamic-programming algorithm
tD/ eD if its table size is O( nt) and each entry depends on O( ne) other entries. For example, the matrix-chain multiplication algorithm in
Section 14.2 is 2 D/1 D, and the longest-common-subsequence algorithm in Section 14.4 is 2 D/0 D.
The MATRIX-CHAIN-ORDER algorithm on page 378 is by
Muraoka and Kuck [339]. Hu and Shing [230, 231] give an O( n lg n)-
time algorithm for the matrix-chain multiplication problem.
The O( mn)-time algorithm for the longest-common-subsequence
problem appears to be a folk algorithm. Knuth [95] posed the question of whether subquadratic algorithms for the LCS problem exist. Masek
and Paterson [316] answered this question in the affirmative by giving an algorithm that runs in O( mn/lg n) time, where n ≤ m and the sequences are drawn from a set of bounded size. For the special case in
which no element appears more than once in an input sequence,
Szymanski [425] shows how to solve the problem in O(( n + m) lg( n +
m)) time. Many of these results extend to the problem of computing
string edit distances (Problem 14-5).
An early paper on variable-length binary encodings by Gilbert and
Moore [181], which had applications to constructing optimal binary search trees for the case in which all probabilities pi are 0, contains an
O( n 3)-time algorithm. Aho, Hopcroft, and Ullman [5] present the algorithm from Section 14.5. Splay trees [418], which modify the tree in response to the search queries, come within a constant factor of the
optimal bounds without being initialized with the frequencies. Exercise
14.5-4 is due to Knuth [264]. Hu and Tucker [232] devised an algorithm for the case in which all probabilities pi are 0 that uses O( n 2) time and O( n) space. Subsequently, Knuth [261] reduced the time to O( n lg n).
Problem 14-8 is due to Avidan and Shamir [30], who have posted on
the web a wonderful video illustrating this image-compression
technique.
1 If pieces are required to be cut in order of monotonically increasing size, there are fewer ways
to consider. For n = 4, only 5 such ways are possible: parts (a), (b), (c), (e), and (h) in Figure

14.2. The number of ways is called the partition function, which is approximately equal to
. This quantity is less than 2 n−1, but still much greater than any polynomial in n.
We won’t pursue this line of inquiry further, however.
2 The technical term “memoization” is not a misspelling of “memorization.” The word
“memoization” comes from “memo,” since the technique consists of recording a value to be looked up later.
3 None of the three methods from Sections 4.1 and Section 4.2 can be used directly, because they apply only to square matrices.
4 The term counts all pairs in which i < j. Because i and j may be equal, we need to add in the n term.
5 We use the term “unweighted” to distinguish this problem from that of finding shortest paths with weighted edges, which we shall see in Chapters 22 and 23. You can use the breadth-first search technique of Chapter 20 to solve the unweighted problem.
6 It may seem strange that dynamic programming relies on subproblems being both independent and overlapping. Although these requirements may sound contradictory, they describe two different notions, rather than two points on the same axis. Two subproblems of the same problem are independent if they do not share resources. Two subproblems are overlapping if they are really the same subproblem that occurs as a subproblem of different problems.
7 This approach presupposes that you know the set of all possible subproblem parameters and that you have established the relationship between table positions and subproblems. Another, more general, approach is to memoize by using hashing with the subproblem parameters as keys.
8 If the subject of the text is ancient Rome, you might want naumachia to appear near the root.
9 Yes, naumachia has a Latvian counterpart: nomačija.
10 Although there are nine positions on a baseball team, N is not necessarily equal to 9 because some general managers have particular ways of thinking about positions. For example, a general manager might consider right-handed pitchers and left-handed pitchers to be separate
“positions,” as well as starting pitchers, long relief pitchers (relief pitchers who can pitch several innings), and short relief pitchers (relief pitchers who normally pitch at most only one inning).
11 Sabermetrics is the application of statistical analysis to baseball records. It provides several ways to compare the relative values of individual players.
Algorithms for optimization problems typically go through a sequence
of steps, with a set of choices at each step. For many optimization
problems, using dynamic programming to determine the best choices is
overkill, and simpler, more efficient algorithms will do. A greedy
algorithm always makes the choice that looks best at the moment. That
is, it makes a locally optimal choice in the hope that this choice leads to
a globally optimal solution. This chapter explores optimization
problems for which greedy algorithms provide optimal solutions. Before
reading this chapter, you should read about dynamic programming in
Chapter 14, particularly Section 14.3.
Greedy algorithms do not always yield optimal solutions, but for
many problems they do. We first examine, in Section 15.1, a simple but nontrivial problem, the activity-selection problem, for which a greedy
algorithm efficiently computes an optimal solution. We’ll arrive at the
greedy algorithm by first considering a dynamic-programming approach
and then showing that an optimal solution can result from always
making greedy choices. Section 15.2 reviews the basic elements of the greedy approach, giving a direct approach for proving greedy
algorithms correct. Section 15.3 presents an important application of greedy techniques: designing data-compression (Huffman) codes.
Finally, Section 15.4 shows that in order to decide which blocks to replace when a miss occurs in a cache, the “furthest-in-future” strategy
is optimal if the sequence of block accesses is known in advance.
The greedy method is quite powerful and works well for a wide range
of problems. Later chapters will present many algorithms that you can
view as applications of the greedy method, including minimum-
spanning-tree algorithms (Chapter 21), Dijkstra’s algorithm for shortest paths from a single source (Section 22.3), and a greedy set-covering heuristic (Section 35.3). Minimum-spanning-tree algorithms furnish a classic example of the greedy method. Although you can read this
chapter and Chapter 21 independently of each other, you might find it useful to read them together.
15.1 An activity-selection problem
Our first example is the problem of scheduling several competing
activities that require exclusive use of a common resource, with a goal of
selecting a maximum-size set of mutually compatible activities. Imagine
that you are in charge of scheduling a conference room. You are
presented with a set S = { a 1, a 2, … , an} of n proposed activities that wish to reserve the conference room, and the room can serve only one
activity at a time. Each activity ai has a start time si and a finish time fi, where 0 ≤ si < fi < ∞. If selected, activity ai takes place during the half-open time interval [ si, fi). Activities ai and aj are compatible if the intervals [ si, fi) and [ sj, fj) do not overlap. That is, ai and aj are compatible if si ≥ fj or sj ≥ fi. (Assume that if your staff needs time to change over the room from one activity to the next, the changeover time
is built into the intervals.) In the activity-selection problem, your goal is
to select a maximum-size subset of mutually compatible activities.
Assume that the activities are sorted in monotonically increasing order
of finish time:
(We’ll see later the advantage that this assumption provides.) For
example, consider the set of activities in Figure 15.1. The subset { a 3, a 9, a 11} consists of mutually compatible activities. It is not a maximum subset, however, since the subset { a 1, a 4, a 8, a 11} is larger. In fact, { a 1,
a 4, a 8, a 11} is a largest subset of mutually compatible activities, and another largest subset is { a 2, a 4, a 9, a 11}.
We’ll see how to solve this problem, proceeding in several steps. First
we’ll explore a dynamic-programming solution, in which you consider
several choices when determining which subproblems to use in an
optimal solution. We’ll then observe that you need to consider only one
choice—the greedy choice—and that when you make the greedy choice,
only one subproblem remains. Based on these observations, we’ll
develop a recursive greedy algorithm to solve the activity-selection
problem. Finally, we’ll complete the process of developing a greedy
solution by converting the recursive algorithm to an iterative one.
Although the steps we go through in this section are slightly more
involved than is typical when developing a greedy algorithm, they
illustrate the relationship between greedy algorithms and dynamic
programming.
Figure 15.1 A set { a 1, a 2, … , a 11} of activities. Activity ai has start time si and finish time fi.
The optimal substructure of the activity-selection problem
Let’s verify that the activity-selection problem exhibits optimal
substructure. Denote by Sij the set of activities that start after activity ai finishes and that finish before activity aj starts. Suppose that you want
to find a maximum set of mutually compatible activities in Sij, and
suppose further that such a maximum set is Aij, which includes some
activity ak. By including ak in an optimal solution, you are left with two subproblems: finding mutually compatible activities in the set Sik
(activities that start after activity ai finishes and that finish before activity ak starts) and finding mutually compatible activities in the set
Skj (activities that start after activity ak finishes and that finish before




activity aj starts). Let Aik = Aij ∩ Sik and Akj = Aij ∩ Skj, so that Aik contains the activities in Aij that finish before ak starts and Akj contains the activities in Aij that start after ak finishes. Thus, we have Aij = Aik ∪
{ ak} ∪ Akj, and so the maximum-size set Aij of mutually compatible activities in Sij consists of | Aij | = | Aik| + | Akj | + 1 activities.
The usual cut-and-paste argument shows that an optimal solution
Aij must also include optimal solutions to the two subproblems for Sik
and Skj. If you could find a set of mutually compatible activities in
Skj where
, then you could use
, rather than Akj, in a
solution to the subproblem for Sij. You would have constructed a set of
mutually compatible activities,
which contradicts the assumption that Aij is an optimal solution. A
symmetric argument applies to the activities in Sik.
This way of characterizing optimal substructure suggests that you
can solve the activity-selection problem by dynamic programming. Let’s
denote the size of an optimal solution for the set Sij by c[ i, j]. Then, the dynamic-programming approach gives the recurrence
c[ i, j] = c[ i, k] + c[ k, j] + 1.
Of course, if you do not know that an optimal solution for the set Sij
includes activity ak, you must examine all activities in Sij to find which one to choose, so that
You can then develop a recursive algorithm and memoize it, or you can
work bottom-up and fill in table entries as you go along. But you would
be overlooking another important characteristic of the activity-selection
problem that you can use to great advantage.
Making the greedy choice
What if you could choose an activity to add to an optimal solution without having to first solve all the subproblems? That could save you
from having to consider all the choices inherent in recurrence (15.2). In
fact, for the activity-selection problem, you need to consider only one
choice: the greedy choice.
What is the greedy choice for the activity-selection problem?
Intuition suggests that you should choose an activity that leaves the
resource available for as many other activities as possible. Of the
activities you end up choosing, one of them must be the first one to
finish. Intuition says, therefore, choose the activity in S with the earliest
finish time, since that leaves the resource available for as many of the
activities that follow it as possible. (If more than one activity in S has
the earliest finish time, then choose any such activity.) In other words,
since the activities are sorted in monotonically increasing order by finish
time, the greedy choice is activity a 1. Choosing the first activity to finish
is not the only way to think of making a greedy choice for this problem.
Exercise 15.1-3 asks you to explore other possibilities.
Once you make the greedy choice, you have only one remaining
subproblem to solve: finding activities that start after a 1 finishes. Why
don’t you have to consider activities that finish before a 1 starts? Because
s 1 < f 1, and because f 1 is the earliest finish time of any activity, no activity can have a finish time less than or equal to s 1. Thus, all activities that are compatible with activity a 1 must start after a 1 finishes.
Furthermore, we have already established that the activity-selection
problem exhibits optimal substructure. Let Sk = { ai ∈ S : si ≥ fk} be the set of activities that start after activity ak finishes. If you make the greedy choice of activity a 1, then S 1 remains as the only subproblem to solve. 1 Optimal substructure says that if a 1 belongs to an optimal solution, then an optimal solution to the original problem consists of
activity a 1 and all the activities in an optimal solution to the
subproblem S 1.
One big question remains: Is this intuition correct? Is the greedy
choice—in which you choose the first activity to finish—always part of



some optimal solution? The following theorem shows that it is.
Theorem 15.1
Consider any nonempty subproblem Sk, and let am be an activity in Sk with the earliest finish time. Then am is included in some maximum-size
subset of mutually compatible activities of Sk.
Proof Let Ak be a maximum-size subset of mutually compatible activities in Sk, and let aj be the activity in Ak with the earliest finish time. If aj = am, we are done, since we have shown that am belongs to some maximum-size subset of mutually compatible activities of Sk. If aj
≠ am, let the set
be Ak but substituting am for aj.
The activities in are compatible, which follows because the activities
in Ak are compatible, aj is the first activity in Ak to finish, and fm ≤ fj.
Since
, we conclude that is a maximum-size subset of
mutually compatible activities of Sk, and it includes am.
▪
Although you might be able to solve the activity-selection problem
with dynamic programming, Theorem 15.1 says that you don’t need to.
Instead, you can repeatedly choose the activity that finishes first, keep
only the activities compatible with this activity, and repeat until no
activities remain. Moreover, because you always choose the activity with
the earliest finish time, the finish times of the activities that you choose
must strictly increase. You can consider each activity just once overall,
in monotonically increasing order of finish times.
An algorithm to solve the activity-selection problem does not need
to work bottom-up, like a table-based dynamic-programming
algorithm. Instead, it can work top-down, choosing an activity to put
into the optimal solution that it constructs and then solving the
subproblem of choosing activities from those that are compatible with
those already chosen. Greedy algorithms typically have this top-down
design: make a choice and then solve a subproblem, rather than the
bottom-up technique of solving subproblems before making a choice.
Now that you know you can bypass the dynamic-programming
approach and instead use a top-down, greedy algorithm, let’s see a
straightforward, recursive procedure to solve the activity-selection
problem. The procedure RECURSIVE-ACTIVITY-SELECTOR on
the following page takes the start and finish times of the activities,
represented as arrays s and f, 2 the index k that defines the subproblem Sk it is to solve, and the size n of the original problem. It returns a maximum-size set of mutually compatible activities in Sk. The
procedure assumes that the n input activities are already ordered by
monotonically increasing finish time, according to equation (15.1). If
not, you can first sort them into this order in O( n lg n) time, breaking ties arbitrarily. In order to start, add the fictitious activity a 0 with f 0 =
0, so that subproblem S 0 is the entire set of activities S. The initial call, which solves the entire problem, is RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, n).
RECURSIVE-ACTIVITY-SELECTOR ( s, f, k, n)
1 m = k + 1
2 while m ≤ n and s[ m] < f [ k] // find the first activity in Sk to finish 3
m = m + 1
4 if m ≤ n
5
return { am} ∪ RECURSIVE-ACTIVITY-SELECTOR ( s, f, m,
n)
6 else return ∅
Figure 15.2 shows how the algorithm operates on the activities in
Figure 15.1. In a given recursive call RECURSIVE-ACTIVITY-
SELECTOR ( s, f, k, n), the while loop of lines 2–3 looks for the first activity in Sk to finish. The loop examines ak+1, ak+2, … , an, until it finds the first activity am that is compatible with ak, which means that sm ≥ fk. If the loop terminates because it finds such an activity, line 5
returns the union of { am} and the maximum-size subset of Sm returned by the recursive call RECURSIVE-ACTIVITY-SELECTOR ( s, f, m, n). Alternatively, the loop may terminate because m > n, in which case the procedure has examined all activities in Sk without finding one that
is compatible with ak. In this case, Sk = ∅ , and so line 6 returns ∅ .
Assuming that the activities have already been sorted by finish times,
the running time of the call RECURSIVE-ACTIVITY-SELECTOR ( s,
f, 0, n) is Θ( n). To see why, observe that over all recursive calls, each activity is examined exactly once in the while loop test of line 2. In
particular, activity ai is examined in the last call made in which k < i.
An iterative greedy algorithm
The recursive procedure can be converted to an iterative one because the
procedure RECURSIVE-ACTIVITY-SELECTOR is almost “tail
recursive” (see Problem 7-5): it ends with a recursive call to itself
followed by a union operation. It is usually a straightforward task to
transform a tail-recursive procedure to an iterative form. In fact, some
compilers for certain programming languages perform this task
automatically.
Figure 15.2 The operation of RECURSIVE-ACTIVITY-SELECTOR on the 11 activities from
Figure 15.1. Activities considered in each recursive call appear between horizontal lines. The fictitious activity a 0 finishes at time 0, and the initial call RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, 11), selects activity a 1. In each recursive call, the activities that have already been selected are blue, and the activity shown in tan is being considered. If the starting time of an activity occurs before the finish time of the most recently added activity (the arrow between them points left), it is rejected. Otherwise (the arrow points directly up or to the right), it is selected. The last recursive call, RECURSIVE-ACTIVITY-SELECTOR ( s, f, 11, 11), returns
∅ . The resulting set of selected activities is { a 1, a 4, a 8, a 11}.
The procedure GREEDY-ACTIVITY-SELECTOR is an iterative
version of the procedure RECURSIVE-ACTIVITY-SELECTOR. It,
too, assumes that the input activities are ordered by monotonically
increasing finish time. It collects selected activities into a set A and returns this set when it is done.
GREEDY-ACTIVITY-SELECTOR ( s, f, n)
1 A = { a 1}
2 k = 1
3 for m = 2 to n
4
if s[ m] ≥ f [ k]
// is am in Sk?
5
A = A ∪ { am}
// yes, so choose it
6
k = m
// and continue from there
7 return A
The procedure works as follows. The variable k indexes the most
recent addition to A, corresponding to the activity ak in the recursive version. Since the procedure considers the activities in order of
monotonically increasing finish time, fk is always the maximum finish
time of any activity in A. That is,
Lines 1–2 select activity a 1, initialize A to contain just this activity, and initialize k to index this activity. The for loop of lines 3–6 finds the earliest activity in Sk to finish. The loop considers each activity am in turn and adds am to A if it is compatible with all previously selected activities. Such an activity is the earliest in Sk to finish. To see whether
activity am is compatible with every activity currently in A, it suffices by equation (15.3) to check (in line 4) that its start time sm is not earlier
than the finish time fk of the activity most recently added to A. If activity am is compatible, then lines 5–6 add activity am to A and set k to m. The set A returned by the call GREEDY-ACTIVITY-
SELECTOR ( s, f) is precisely the set returned by the initial call RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, n).
Like the recursive version, GREEDY-ACTIVITY-SELECTOR
schedules a set of n activities in Θ( n) time, assuming that the activities were already sorted initially by their finish times.
Exercises
15.1-1
Give a dynamic-programming algorithm for the activity-selection
problem, based on recurrence (15.2). Have your algorithm compute the
sizes c[ i, j] as defined above and also produce the maximum-size subset of mutually compatible activities. Assume that the inputs have been
sorted as in equation (15.1). Compare the running time of your solution
to the running time of GREEDY-ACTIVITY-SELECTOR.
15.1-2
Suppose that instead of always selecting the first activity to finish, you
instead select the last activity to start that is compatible with all
previously selected activities. Describe how this approach is a greedy
algorithm, and prove that it yields an optimal solution.
15.1-3
Not just any greedy approach to the activity-selection problem produces
a maximum-size set of mutually compatible activities. Give an example
to show that the approach of selecting the activity of least duration
from among those that are compatible with previously selected activities
does not work. Do the same for the approaches of always selecting the
compatible activity that overlaps the fewest other remaining activities
and always selecting the compatible remaining activity with the earliest
start time.
15.1-4
You are given a set of activities to schedule among a large number of
lecture halls, where any activity can take place in any lecture hall. You
wish to schedule all the activities using as few lecture halls as possible.
Give an efficient greedy algorithm to determine which activity should
use which lecture hall.
(This problem is also known as the interval-graph coloring problem.
It is modeled by an interval graph whose vertices are the given activities
and whose edges connect incompatible activities. The smallest number
of colors required to color every vertex so that no two adjacent vertices
have the same color corresponds to finding the fewest lecture halls
needed to schedule all of the given activities.)
15.1-5
Consider a modification to the activity-selection problem in which each
activity ai has, in addition to a start and finish time, a value vi. The objective is no longer to maximize the number of activities scheduled,
but instead to maximize the total value of the activities scheduled. That
is, the goal is to choose a set A of compatible activities such that is maximized. Give a polynomial-time algorithm for this
problem.
15.2 Elements of the greedy strategy
A greedy algorithm obtains an optimal solution to a problem by
making a sequence of choices. At each decision point, the algorithm
makes the choice that seems best at the moment. This heuristic strategy
does not always produce an optimal solution, but as in the activity-
selection problem, sometimes it does. This section discusses some of the
general properties of greedy methods.
The process that we followed in Section 15.1 to develop a greedy algorithm was a bit more involved than is typical. It consisted of the
following steps:
1. Determine the optimal substructure of the problem.
2. Develop a recursive solution. (For the activity-selection problem,
we formulated recurrence (15.2), but bypassed developing a
recursive algorithm based solely on this recurrence.)
3. Show that if you make the greedy choice, then only one
subproblem remains.
4. Prove that it is always safe to make the greedy choice. (Steps 3
and 4 can occur in either order.)
5. Develop a recursive algorithm that implements the greedy
strategy.
6. Convert the recursive algorithm to an iterative algorithm.
These steps highlighted in great detail the dynamic-programming
underpinnings of a greedy algorithm. For example, the first cut at the
activity-selection problem defined the subproblems Sij, where both i and j varied. We then found that if you always make the greedy choice, you
can restrict the subproblems to be of the form Sk.
An alternative approach is to fashion optimal substructure with a
greedy choice in mind, so that the choice leaves just one subproblem to
solve. In the activity-selection problem, start by dropping the second
subscript and defining subproblems of the form Sk. Then prove that a
greedy choice (the first activity am to finish in Sk), combined with an optimal solution to the remaining set Sm of compatible activities, yields
an optimal solution to Sk. More generally, you can design greedy
algorithms according to the following sequence of steps:
1. Cast the optimization problem as one in which you make a
choice and are left with one subproblem to solve.
2. Prove that there is always an optimal solution to the original
problem that makes the greedy choice, so that the greedy choice
is always safe.
3. Demonstrate optimal substructure by showing that, having made
the greedy choice, what remains is a subproblem with the
property that if you combine an optimal solution to the
subproblem with the greedy choice you have made, you arrive at
an optimal solution to the original problem.
Later sections of this chapter will use this more direct process.
Nevertheless, beneath every greedy algorithm, there is almost always a
more cumbersome dynamic-programming solution.
How can you tell whether a greedy algorithm will solve a particular
optimization problem? No way works all the time, but the greedy-choice
property and optimal substructure are the two key ingredients. If you
can demonstrate that the problem has these properties, then you are well
on the way to developing a greedy algorithm for it.
Greedy-choice property
The first key ingredient is the greedy-choice property: you can assemble
a globally optimal solution by making locally optimal (greedy) choices.
In other words, when you are considering which choice to make, you
make the choice that looks best in the current problem, without
considering results from subproblems.
Here is where greedy algorithms differ from dynamic programming.
In dynamic programming, you make a choice at each step, but the
choice usually depends on the solutions to subproblems. Consequently,
you typically solve dynamic-programming problems in a bottom-up
manner, progressing from smaller subproblems to larger subproblems.
(Alternatively, you can solve them top down, but memoizing. Of course,
even though the code works top down, you still must solve the
subproblems before making a choice.) In a greedy algorithm, you make
whatever choice seems best at the moment and then solve the
subproblem that remains. The choice made by a greedy algorithm may
depend on choices so far, but it cannot depend on any future choices or
on the solutions to subproblems. Thus, unlike dynamic programming,
which solves the subproblems before making the first choice, a greedy
algorithm makes its first choice before solving any subproblems. A
dynamic-programming algorithm proceeds bottom up, whereas a greedy
strategy usually progresses top down, making one greedy choice after
another, reducing each given problem instance to a smaller one.
Of course, you need to prove that a greedy choice at each step yields
a globally optimal solution. Typically, as in the case of Theorem 15.1,
the proof examines a globally optimal solution to some subproblem. It
then shows how to modify the solution to substitute the greedy choice
for some other choice, resulting in one similar, but smaller, subproblem.
You can usually make the greedy choice more efficiently than when
you have to consider a wider set of choices. For example, in the activity-
selection problem, assuming that the activities were already sorted in
monotonically increasing order by finish times, each activity needed to
be examined just once. By preprocessing the input or by using an
appropriate data structure (often a priority queue), you often can make
greedy choices quickly, thus yielding an efficient algorithm.
Optimal substructure
As we saw in Chapter 14, a problem exhibits optimal substructure if an optimal solution to the problem contains within it optimal solutions to
subproblems. This property is a key ingredient of assessing whether
dynamic programming applies, and it’s also essential for greedy
algorithms. As an example of optimal substructure, recall how Section
15.1 demonstrated that if an optimal solution to subproblem Sij
includes an activity ak, then it must also contain optimal solutions to
the subproblems Sik and Skj. Given this optimal substructure, we argued that if you know which activity to use as ak, you can construct
an optimal solution to Sij by selecting ak along with all activities in optimal solutions to the subproblems Sik and Skj. This observation of
optimal substructure gave rise to the recurrence (15.2) that describes the
value of an optimal solution.
You will usually use a more direct approach regarding optimal
substructure when applying it to greedy algorithms. As mentioned
above, you have the luxury of assuming that you arrived at a
subproblem by having made the greedy choice in the original problem.
All you really need to do is argue that an optimal solution to the
subproblem, combined with the greedy choice already made, yields an
optimal solution to the original problem. This scheme implicitly uses
induction on the subproblems to prove that making the greedy choice at
every step produces an optimal solution.
Greedy versus dynamic programming
Because both the greedy and dynamic-programming strategies exploit
optimal substructure, you might be tempted to generate a dynamic-
programming solution to a problem when a greedy solution suffices or,
conversely, you might mistakenly think that a greedy solution works
when in fact a dynamic-programming solution is required. To illustrate
the subtle differences between the two techniques, let’s investigate two
variants of a classical optimization problem.
The 0-1 knapsack problem is the following. A thief robbing a store
wants to take the most valuable load that can be carried in a knapsack
capable of carrying at most W pounds of loot. The thief can choose to
take any subset of n items in the store. The i th item is worth vi dollars and weighs wi pounds, where vi and wi are integers. Which items should the thief take? (We call this the 0-1 knapsack problem because for each
item, the thief must either take it or leave it behind. The thief cannot
take a fractional amount of an item or take an item more than once.)
In the fractional knapsack problem, the setup is the same, but the
thief can take fractions of items, rather than having to make a binary (0-
1) choice for each item. You can think of an item in the 0-1 knapsack
problem as being like a gold ingot and an item in the fractional
knapsack problem as more like gold dust.
Both knapsack problems exhibit the optimal-substructure property.
For the 0-1 problem, if the most valuable load weighing at most W
pounds includes item j, then the remaining load must be the most
valuable load weighing at most W − wj pounds that the thief can take
from the n − 1 original items excluding item j. For the comparable fractional problem, if if the most valuable load weighing at most W
pounds includes weight w of item j, then the remaining load must be the most valuable load weighing at most W − w pounds that the thief can
take from the n − 1 original items plus wj − w pounds of item j.
Although the problems are similar, a greedy strategy works to solve
the fractional knapsack problem, but not the 0-1 problem. To solve the
fractional problem, first compute the value per pound vi/ wi for each item. Obeying a greedy strategy, the thief begins by taking as much as
possible of the item with the greatest value per pound. If the supply of that item is exhausted and the thief can still carry more, then the thief
takes as much as possible of the item with the next greatest value per
pound, and so forth, until reaching the weight limit W. Thus, by sorting
the items by value per pound, the greedy algorithm runs in O( n lg n) time. You are asked to prove that the fractional knapsack problem has
the greedy-choice property in Exercise 15.2-1.
To see that this greedy strategy does not work for the 0-1 knapsack
problem, consider the problem instance illustrated in Figure 15.3(a).
This example has three items and a knapsack that can hold 50 pounds.
Item 1 weighs 10 pounds and is worth $60. Item 2 weighs 20 pounds
and is worth $100. Item 3 weighs 30 pounds and is worth $120. Thus,
the value per pound of item 1 is $6 per pound, which is greater than the
value per pound of either item 2 ($5 per pound) or item 3 ($4 per
pound). The greedy strategy, therefore, would take item 1 first. As you
can see from the case analysis in Figure 15.3(b), however, the optimal solution takes items 2 and 3, leaving item 1 behind. The two possible
solutions that take item 1 are both suboptimal.
For the comparable fractional problem, however, the greedy strategy,
which takes item 1 first, does yield an optimal solution, as shown in
Figure 15.3(c). Taking item 1 doesn’t work in the 0-1 problem, because the thief is unable to fill the knapsack to capacity, and the empty space
lowers the effective value per pound of the load. In the 0-1 problem,
when you consider whether to include an item in the knapsack, you
must compare the solution to the subproblem that includes the item
with the solution to the subproblem that excludes the item before you
can make the choice. The problem formulated in this way gives rise to
many overlapping subproblems—a hallmark of dynamic programming,
and indeed, as Exercise 15.2-2 asks you to show, you can use dynamic
programming to solve the 0-1 problem.
Figure 15.3 An example showing that the greedy strategy does not work for the 0-1 knapsack problem. (a) The thief must select a subset of the three items shown whose weight must not exceed 50 pounds. (b) The optimal subset includes items 2 and 3. Any solution with item 1 is suboptimal, even though item 1 has the greatest value per pound. (c) For the fractional knapsack problem, taking the items in order of greatest value per pound yields an optimal solution.
Exercises
15.2-1
Prove that the fractional knapsack problem has the greedy-choice
property.
15.2-2
Give a dynamic-programming solution to the 0-1 knapsack problem
that runs in O( n W) time, where n is the number of items and W is the maximum weight of items that the thief can put in the knapsack.
15.2-3
Suppose that in a 0-1 knapsack problem, the order of the items when
sorted by increasing weight is the same as their order when sorted by
decreasing value. Give an efficient algorithm to find an optimal solution
to this variant of the knapsack problem, and argue that your algorithm
is correct.
15.2-4
Professor Gekko has always dreamed of inline skating across North
Dakota. The professor plans to cross the state on highway U.S. 2, which
runs from Grand Forks, on the eastern border with Minnesota, to
Williston, near the western border with Montana. The professor can
carry two liters of water and can skate m miles before running out of
water. (Because North Dakota is relatively flat, the professor does not
have to worry about drinking water at a greater rate on uphill sections
than on flat or downhill sections.) The professor will start in Grand
Forks with two full liters of water. The professor has an official North
Dakota state map, which shows all the places along U.S. 2 to refill water
and the distances between these locations.
The professor’s goal is to minimize the number of water stops along
the route across the state. Give an efficient method by which the
professor can determine which water stops to make. Prove that your
strategy yields an optimal solution, and give its running time.
15.2-5
Describe an efficient algorithm that, given a set { x 1, x 2, … , xn} of points on the real line, determines the smallest set of unit-length closed
intervals that contains all of the given points. Argue that your algorithm
is correct.
★ 15.2-6
Show how to solve the fractional knapsack problem in O( n) time.
15.2-7
You are given two sets A and B, each containing n positive integers. You can choose to reorder each set however you like. After reordering, let ai
be the i th element of set A, and let bi be the i th element of set B. You then receive a payoff of
. Give an algorithm that maximizes your
payoff. Prove that your algorithm maximizes the payoff, and state its
running time, omitting the time for reordering the sets.
Huffman codes compress data well: savings of 20% to 90% are typical,
depending on the characteristics of the data being compressed. The data
arrive as a sequence of characters. Huffman’s greedy algorithm uses a
table giving how often each character occurs (its frequency) to build up
an optimal way of representing each character as a binary string.
Suppose that you have a 100,000-character data file that you wish to
store compactly and you know that the 6 distinct characters in the file
occur with the frequencies given by Figure 15.4. The character a occurs 45,000 times, the character b occurs 13,000 times, and so on.
You have many options for how to represent such a file of
information. Here, we consider the problem of designing a binary
character code (or code for short) in which each character is represented by a unique binary string, which we call a codeword. If you use a fixed-length code, you need ⌈lg n⌉ bits to represent n ≥ 2 characters. For 6
characters, therefore, you need 3 bits: a = 000, b = 001, c = 010, d =
011, e = 100, and f = 101. This method requires 300,000 bits to encode
the entire file. Can you do better?
Figure 15.4 A character-coding problem. A data file of 100,000 characters contains only the characters a–f, with the frequencies indicated. With each character represented by a 3-bit codeword, encoding the file requires 300,000 bits. With the variable-length code shown, the encoding requires only 224,000 bits.
A variable-length code can do considerably better than a fixed-length
code. The idea is simple: give frequent characters short codewords and
infrequent characters long codewords. Figure 15.4 shows such a code.
Here, the 1-bit string 0 represents a, and the 4-bit string 1100 represents
f. This code requires
(45 · 1 + 13 · 3 + 12 · 3 + 16 · 3 + 9 · 4 + 5 · 4) · 1,000 = 224,000 bits
to represent the file, a savings of approximately 25%. In fact, this is an
optimal character code for this file, as we shall see.
Prefix-free codes
We consider here only codes in which no codeword is also a prefix of
some other codeword. Such codes are called prefix-free codes. Although
we won’t prove it here, a prefix-free code can always achieve the optimal
data compression among any character code, and so we suffer no loss of
generality by restricting our attention to prefix-free codes.
Encoding is always simple for any binary character code: just
concatenate the codewords representing each character of the file. For
example, with the variable-length prefix-free code of Figure 15.4, the 4-character file face has the encoding 1100 · 0 · 100 · 1101 =
110001001101, where “·” denotes concatenation.
Prefix-free codes are desirable because they simplify decoding. Since
no codeword is a prefix of any other, the codeword that begins an
encoded file is unambiguous. You can simply identify the initial
codeword, translate it back to the original character, and repeat the
decoding process on the remainder of the encoded file. In our example,
the string 100011001101 parses uniquely as 100 · 0 · 1100 · 1101, which
decodes to cafe.
Figure 15.5 Trees corresponding to the coding schemes in Figure 15.4. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree. All frequencies are in thousands. (a) The tree corresponding to the fixed-length code a = 000, b = 001, c = 010, d = 011, e = 100, f = 101. (b) The tree corresponding to the optimal prefix-free code a = 0, b = 101, c = 100, d = 111, e =
1101, f = 1100.
The decoding process needs a convenient representation for the
prefix-free code so that you can easily pick off the initial codeword. A
binary tree whose leaves are the given characters provides one such
representation. Interpret the binary codeword for a character as the
simple path from the root to that character, where 0 means “go to the
left child” and 1 means “go to the right child.” Figure 15.5 shows the trees for the two codes of our example. Note that these are not binary
search trees, since the leaves need not appear in sorted order and
internal nodes do not contain character keys.
An optimal code for a file is always represented by a full binary tree,
in which every nonleaf node has two children (see Exercise 15.3-2). The
fixed-length code in our example is not optimal since its tree, shown in
Figure 15.5(a), is not a full binary tree: it contains codewords beginning with 10, but none beginning with 11. Since we can now restrict our
attention to full binary trees, we can say that if C is the alphabet from
which the characters are drawn and all character frequencies are
positive, then the tree for an optimal prefix-free code has exactly | C |
leaves, one for each letter of the alphabet, and exactly | C | − 1 internal
nodes (see Exercise B.5-3 on page 1175).
Given a tree T corresponding to a prefix-free code, we can compute
the number of bits required to encode a file. For each character c in the
alphabet C, let the attribute c. freq denote the frequency of c in the file and let dT( c) denote the depth of c’s leaf in the tree. Note that dT ( c) is also the length of the codeword for character c. The number of bits required to encode a file is thus
which we define as the cost of the tree T.
Constructing a Huffman code
Huffman invented a greedy algorithm that constructs an optimal prefix-
free code, called a Huffman code in his honor. In line with our
observations in Section 15.2, its proof of correctness relies on the greedy-choice property and optimal substructure. Rather than
demonstrating that these properties hold and then developing
pseudocode, we present the pseudocode first. Doing so will help clarify how the algorithm makes greedy choices.
The procedure HUFFMAN assumes that C is a set of n characters
and that each character c ∈ C is an object with an attribute c. freq giving its frequency. The algorithm builds the tree T corresponding to an
optimal code in a bottom-up manner. It begins with a set of | C | leaves
and performs a sequence of | C | − 1 “merging” operations to create the
final tree. The algorithm uses a min-priority queue Q, keyed on the freq
attribute, to identify the two least-frequent objects to merge together.
The result of merging two objects is a new object whose frequency is the
sum of the frequencies of the two objects that were merged.
HUFFMAN( C)
1 n = | C |
2 Q = C
3 for i = 1 to n − 1
4
allocate a new node z
5
x = EXTRACT-MIN( Q)
6
y = EXTRACT-MIN( Q)
7
z. left = x
8
z. right = y
9
z. freq = x. freq + y. freq
10
INSERT( Q, z)
11 return EXTRACT-MIN( Q) // the root of the tree is the only node
left
For our example, Huffman’s algorithm proceeds as shown in Figure
15.6. Since the alphabet contains 6 letters, the initial queue size is n = 6,
and 5 merge steps build the tree. The final tree represents the optimal
prefix-free code. The codeword for a letter is the sequence of edge labels
on the simple path from the root to the letter.
Figure 15.6 The steps of Huffman’s algorithm for the frequencies given in Figure 15.4. Each part shows the contents of the queue sorted into increasing order by frequency. Each step merges the two trees with the lowest frequencies. Leaves are shown as rectangles containing a character and its frequency. Internal nodes are shown as circles containing the sum of the frequencies of their children. An edge connecting an internal node with its children is labeled 0 if it is an edge to a left child and 1 if it is an edge to a right child. The codeword for a letter is the sequence of labels on the edges connecting the root to the leaf for that letter. (a) The initial set of n = 6 nodes, one for each letter. (b)–(e) Intermediate stages. (f) The final tree.
The HUFFMAN procedure works as follows. Line 2 initializes the
min-priority queue Q with the characters in C. The for loop in lines 3–
10 repeatedly extracts the two nodes x and y of lowest frequency from
the queue and replaces them in the queue with a new node z
representing their merger. The frequency of z is computed as the sum of
the frequencies of x and y in line 9. The node z has x as its left child and y as its right child. (This order is arbitrary. Switching the left and right
child of any node yields a different code of the same cost.) After n − 1
mergers, line 11 returns the one node left in the queue, which is the root of the code tree.
The algorithm produces the same result without the variables x and
y, assigning the values returned by the EXTRACT-MIN calls directly
to z. left and z. right in lines 7 and 8, and changing line 9 to z. freq =
z. left. freq+ z. right. freq. We’ll use the node names x and y in the proof of correctness, however, so we leave them in.
The running time of Huffman’s algorithm depends on how the min-
priority queue Q is implemented. Let’s assume that it’s implemented as
a binary min-heap (see Chapter 6). For a set C of n characters, the BUILD-MIN-HEAP procedure discussed in Section 6.3 can initialize Q
in line 2 in O( n) time. The for loop in lines 3–10 executes exactly n − 1
times, and since each heap operation runs in O(lg n) time, the loop contributes O( n lg n) to the running time. Thus, the total running time of HUFFMAN on a set of n characters is O( n lg n).
Correctness of Huffman’s algorithm
To prove that the greedy algorithm HUFFMAN is correct, we’ll show
that the problem of determining an optimal prefix-free code exhibits the
greedy-choice and optimal-substructure properties. The next lemma
shows that the greedy-choice property holds.
Lemma 15.2 (Optimal prefix-free codes have the greedy-choice property)
Let C be an alphabet in which each character c ∈ C has frequency c. freq. Let x and y be two characters in C having the lowest frequencies.
Then there exists an optimal prefix-free code for C in which the
codewords for x and y have the same length and differ only in the last
bit.
Proof The idea of the proof is to take the tree T representing an arbitrary optimal prefix-free code and modify it to make a tree
representing another optimal prefix-free code such that the characters x
and y appear as sibling leaves of maximum depth in the new tree. In such a tree, the codewords for x and y have the same length and differ
only in the last bit.

Let a and b be any two characters that are sibling leaves of maximum
depth in T. Without loss of generality, assume that a. freq ≤ b. freq and x. freq ≤ y. freq. Since x. freq and y. freq are the two lowest leaf frequencies, in order, and a. freq and b. freq are two arbitrary frequencies, in order, we have x. freq ≤ a. freq and y. freq ≤ b. freq.
In the remainder of the proof, it is possible that we could have x. freq
= a. freq or y. freq = b. freq, but x. freq = b. freq implies that a. freq = b. freq
= x. freq = y. freq (see Exercise 15.3-1), and the lemma would be trivially true. Therefore, assume that x.freq ≠ b.freq, which means that x ≠ b.
Figure 15.7 An illustration of the key step in the proof of Lemma 15.2. In the optimal tree T, leaves a and b are two siblings of maximum depth. Leaves x and y are the two characters with the lowest frequencies. They appear in arbitrary positions in T. Assuming that x ≠ b, swapping leaves a and x produces tree T′, and then swapping leaves b and y produces tree T ″. Since each swap does not increase the cost, the resulting tree T ″ is also an optimal tree.
As Figure 15.7 shows, imagine exchanging the positions in T of a and x to produce a tree T′, and then exchanging the positions in T′ of b and y to produce a tree T″ in which x and y are sibling leaves of maximum depth. (Note that if x = b but y ≠ a, then tree T ″ does not have x and y as sibling leaves of maximum depth. Because we assume
that x ≠ b, this situation cannot occur.) By equation (15.4), the difference in cost between T and T′ is
because both a. freq − x. freq and dT ( a) − dT ( x) are nonnegative. More specifically, a.freq − x. freq is nonnegative because x is a minimum-frequency leaf, and dT ( a) − dT ( x) is nonnegative because a is a leaf of maximum depth in T. Similarly, exchanging y and b does not increase the cost, and so B( T′) − B( T ″) is nonnegative. Therefore, B( T ″) ≤ B( T′)
≤ B( T), and since T is optimal, we have B( T) ≤ B( T ″), which implies B( T
″) = B( T). Thus, T ″ is an optimal tree in which x and y appear as sibling leaves of maximum depth, from which the lemma follows.
▪
Lemma 15.2 implies that the process of building up an optimal tree
by mergers can, without loss of generality, begin with the greedy choice
of merging together those two characters of lowest frequency. Why is
this a greedy choice? We can view the cost of a single merger as being
the sum of the frequencies of the two items being merged. Exercise 15.3-
4 shows that the total cost of the tree constructed equals the sum of the
costs of its mergers. Of all possible mergers at each step, HUFFMAN
chooses the one that incurs the least cost.
The next lemma shows that the problem of constructing optimal
prefix-free codes has the optimal-substructure property.
Lemma 15.3 (Optimal prefix-free codes have the optimal-substructure
property)
Let C be a given alphabet with frequency c. freq defined for each character c ∈ C. Let x and y be two characters in C with minimum frequency. Let C′ be the alphabet C with the characters x and y removed and a new character z added, so that C′ = ( C − { x, y}) ∪ { z}. Define freq for all characters in C′ with the same values as in C, along with z. freq = x. freq + y. freq. Let T′ be any tree representing an optimal prefix-free code for alphabet C′. Then the tree T, obtained from T′ by replacing the leaf node for z with an internal node having x and y as children, represents an optimal prefix-free code for the alphabet C.
Proof We first show how to express the cost B( T) of tree T in terms of the cost B( T′) of tree T′, by considering the component costs in equation (15.4). For each character c ∈ C − { x, y}, we have that dT ( c)
= dT′ ( c), and hence c. freq · dT ( c) = c. freq · dT′ ( c). Since dT ( x) = dT
( y) = dT′ ( z) + 1, we have
x. freq · dT ( x) + y. freq · dT ( y) = ( x. freq + y. freq)( dT′ ( z) + 1)
= z. freq · dT′( z)+ ( x. freq + y. freq), from which we conclude that
B( T) = B( T′) + x. freq + y. freq or, equivalently,
B( T′) = B( T) − x. freq − y. freq.
We now prove the lemma by contradiction. Suppose that T does not
represent an optimal prefix-free code for C. Then there exists an optimal
tree T″ such that B( T″) < B( T). Without loss of generality (by Lemma 15.2), T″ has x and y as siblings. Let T″′ be the tree T″ with the common parent of x and y replaced by a leaf z with frequency z. freq = x. freq +
y. freq. Then
B( T‴) = B( T″) − x. freq − y. freq
< B( T) − x. freq − y. freq
= B( T′),
yielding a contradiction to the assumption that T′ represents an optimal
prefix-free code for C′. Thus, T must represent an optimal prefix-free code for the alphabet C.
▪
Theorem 15.4
Procedure HUFFMAN produces an optimal prefix-free code.
Proof Immediate from Lemmas 15.2 and 15.3.
Exercises
15.3-1
Explain why, in the proof of Lemma 15.2, if x. freq = b. freq, then we must have a. freq = b. freq = x. freq = y. freq.
15.3-2
Prove that a non-full binary tree cannot correspond to an optimal
prefix-free code.
15.3-3
What is an optimal Huffman code for the following set of frequencies,
based on the first 8 Fibonacci numbers?
a:1 b:1 c:2 d:3 e:5 f:8 g:13 h:21
Can you generalize your answer to find the optimal code when the
frequencies are the first n Fibonacci numbers?
15.3-4
Prove that the total cost B( T) of a full binary tree T for a code equals the sum, over all internal nodes, of the combined frequencies of the two
children of the node.
15.3-5
Given an optimal prefix-free code on a set C of n characters, you wish to transmit the code itself using as few bits as possible. Show how to
represent any optimal prefix-free code on C using only 2 n − 1 + n ⌈lg n⌉
bits. ( Hint: Use 2 n − 1 bits to specify the structure of the tree, as discovered by a walk of the tree.)
15.3-6
Generalize Huffman’s algorithm to ternary codewords (i.e., codewords
using the symbols 0, 1, and 2), and prove that it yields optimal ternary
codes.
15.3-7
A data file contains a sequence of 8-bit characters such that all 256
characters are about equally common: the maximum character
frequency is less than twice the minimum character frequency. Prove
that Huffman coding in this case is no more efficient than using an ordinary 8-bit fixed-length code.
15.3-8
Show that no lossless (invertible) compression scheme can guarantee
that for every input file, the corresponding output file is shorter. ( Hint:
Compare the number of possible files with the number of possible
encoded files.)
Computer systems can decrease the time to access data by storing a
subset of the main memory in the cache: a small but faster memory. A
cache organizes data into cache blocks typically comprising 32, 64, or
128 bytes. You can also think of main memory as a cache for disk-
resident data in a virtual-memory system. Here, the blocks are called
pages, and 4096 bytes is a typical size.
As a computer program executes, it makes a sequence of memory
requests. Say that there are n memory requests, to data in blocks b 1, b 2,
… , bn, in that order. The blocks in the access sequence might not be
distinct, and indeed, any given block is usually accessed multiple times.
For example, a program that accesses four distinct blocks p, q, r, s might make a sequence of requests to blocks s, q, s, q, q, s, p, p, r, s, s, q, p, r, q. The cache can hold up to some fixed number k of cache blocks. It starts out empty before the first request. Each request causes at most
one block to enter the cache and at most one block to be evicted from
the cache. Upon a request for block bi, any one of three scenarios may
occur:
1. Block bi is already in the cache, due to a previous request for the
same block. The cache remains unchanged. This situation is
known as a cache hit.
2. Block bi is not in the cache at that time, but the cache contains
fewer than k blocks. In this case, block bi is placed into the
cache, so that the cache contains one more block than it did before the request.
3. Block bi is not in the cache at that time and the cache is full: it
contains k blocks. Block bi is placed into the cache, but before
that happens, some other block in the cache must be evicted
from the cache in order to make room.
The latter two situations, in which the requested block is not already
in the cache, are called cache misses. The goal is to minimize the number
of cache misses or, equivalently, to maximize the number of cache hits,
over the entire sequence of n requests. A cache miss that occurs while
the cache holds fewer than k blocks—that is, as the cache is first being
filled up—is known as a compulsory miss, since no prior decision could
have kept the requested block in the cache. When a cache miss occurs
and the cache is full, ideally the choice of which block to evict should
allow for the smallest possible number of cache misses over the entire
sequence of future requests.
Typically, caching is an online problem. That is, the computer has to
decide which blocks to keep in the cache without knowing the future
requests. Here, however, let’s consider the offline version of this
problem, in which the computer knows in advance the entire sequence
of n requests and the cache size k, with a goal of minimizing the total number of cache misses.
To solve this offline problem, you can use a greedy strategy called
furthest-in-future, which chooses to evict the block in the cache whose
next access in the request sequence comes furthest in the future.
Intuitively, this strategy makes sense: if you’re not going to need
something for a while, why keep it around? We’ll show that the furthest-
in-future strategy is indeed optimal by showing that the offline caching
problem exhibits optimal substructure and that furthest-in-future has
the greedy-choice property.
Now, you might be thinking that since the computer usually doesn’t
know the sequence of requests in advance, there is no point in studying
the offline problem. Actually, there is. In some situations, you do know
the sequence of requests in advance. For example, if you view the main
memory as the cache and the full set of data as residing on disk (or a solid-state drive), there are algorithms that plan out the entire set of
reads and writes in advance. Furthermore, we can use the number of
cache misses produced by an optimal algorithm as a baseline for
comparing how well online algorithms perform. We’ll do just that in
Offline caching can even model real-world problems. For example,
consider a scenario where you know in advance a fixed schedule of n
events at known locations. Events may occur at a location multiple
times, not necessarily consecutively. You are managing a group of k
agents, you need to ensure that you have one agent at each location
when an event occurs, and you want to minimize the number of times
that agents have to move. Here, the agents are like the blocks, the events
are like the requests, and moving an agent is akin to a cache miss.
Optimal substructure of offline caching
To show that the offline problem exhibits optimal substructure, let’s
define the subproblem ( C, i) as processing requests for blocks bi, bi+1,
… , bn with cache configuration C at the time that the request for block bi occurs, that is, C is a subset of the set of blocks such that | C | ≤ k. A solution to subproblem ( C, i) is a sequence of decisions that specifies which block to evict (if any) upon each request for blocks bi, bi+1, … , bn. An optimal solution to subproblem ( C, i) minimizes the number of cache misses.
Consider an optimal solution S to subproblem ( C, i), and let C′ be the contents of the cache after processing the request for block bi in solution S. Let S′ be the subsolution of S for the resulting subproblem ( C′, i + 1). If the request for bi results in a cache hit, then the cache remains unchanged, so that C′ = C. If the request for block bi results in a cache miss, then the contents of the cache change, so that C′ ≠ C. We claim that in either case, S′ is an optimal solution to subproblem ( C′, i +
1). Why? If S′ is not an optimal solution to subproblem ( C′, i + 1), then there exists another solution S″ to subproblem ( C′, i + 1) that makes fewer cache misses than S′. Combining S″ with the decision of S at the
request for block bi yields another solution that makes fewer cache
misses than S, which contradicts the assumption that S is an optimal solution to subproblem ( C, i).
To quantify a recursive solution, we need a little more notation. Let
RC, i be the set of all cache configurations that can immediately follow configuration C after processing a request for block bi. If the request results in a cache hit, then the cache remains unchanged, so that RC, i =
{ C }. If the request for bi results in a cache miss, then there are two possibilities. If the cache is not full (| C | < k), then the cache is filling up and the only choice is to insert bi into the cache, so that RC, i= { C ∪
{ bi}}. If the cache is full (| C | = k) upon a cache miss, then RC, i contains k potential configurations: one for each candidate block in C
that could be evicted and replaced by block bi. In this case, RC, i = {( C
− { x}) ∪ { bi} : x ∈ C }. For example, if C = { p, q, r}, k = 3, and block s is requested, then RC, i = {{ p, q, s},{ p, r, s},{ q, r, s}}.
Let miss( C, i) denote the minimum number of cache misses in a solution for subproblem ( C, i). Here is a recurrence for miss( C, i): Greedy-choice property
To prove that the furthest-in-future strategy yields an optimal solution,
we need to show that optimal offline caching exhibits the greedy-choice
property. Combined with the optimal-substructure property, the greedy-
choice property will prove that furthest-in-future produces the
minimum possible number of cache misses.
Theorem 15.5 (Optimal offline caching has the greedy-choice property)
Consider a subproblem ( C, i) when the cache C contains k blocks, so that it is full, and a cache miss occurs. When block bi is requested, let z
= bm be the block in C whose next access is furthest in the future. (If some block in the cache will never again be referenced, then consider
any such block to be block z, and add a dummy request for block z =
bm = bn+1.) Then evicting block z upon a request for block bi is included in some optimal solution for the subproblem ( C, i).
Proof Let S be an optimal solution to subproblem ( C, i). If S evicts block z upon the request for block bi, then we are done, since we have
shown that some optimal solution includes evicting z.
So now suppose that optimal solution S evicts some other block x
when block bi is requested. We’ll construct another solution S′ to subproblem ( C, i) which, upon the request for bi, evicts block z instead of x and induces no more cache misses than S does, so that S′ is also optimal. Because different solutions may yield different cache
configurations, denote by CS, j the configuration of the cache under solution S just before the request for some block bj, and likewise for solution S′ and CS′, j. We’ll show how to construct S′ with the following properties:
1. For j = i + 1, … , m, let Dj = CS, j ∩ CS′, j. Then, | Dj | ≥ k − 1, so that the cache configurations CS, j and CS′, j differ by at most one block. If they differ, then CS, j = Dj ∪ { z} and CS′, j = Dj ∪
{ y} for some block y ≠ z.
2. For each request of blocks bi, … , bm−1, if solution S has a cache hit, then solution S′ also has a cache hit.
3. For all j > m, the cache configurations CS, j and CS′, j are identical.
4. Over the sequence of requests for blocks bi, … , bm, the number
of cache misses produced by solution S′ is at most the number of
cache misses produced by solution S.
We’ll prove inductively that these properties hold for each request.
1. We proceed by induction on j, for j = i +1, … , m. For the base case, the initial caches CS, i and CS′, i are identical. Upon the request for block bi, solution S evicts x and solution S′ evicts z.
Thus, cache configurations CS, i+1 and CS′, i+1 differ by just one block, CS, i+1 = Di+1 ∪ { z}, CS′, i+1 = Di+1 ∪ { x}, and x ≠ z.
The inductive step defines how solution S′ behaves upon a
request for block bj for i + 1 ≤ j ≤ m − 1. The inductive hypothesis is that property 1 holds when bj is requested. Because
z = bm is the block in CS, i whose next reference is furthest in the future, we know that bj ≠ z. We consider several scenarios:
If CS, j = CS′, j (so that | Dj | = k), then solution S′ makes the same decision upon the request for bj as S makes, so
that CS, j+1 = CS′, j+1.
If | Dj| = k − 1 and bj ∈ Dj, then both caches already contain block bj, and both solutions S and S′ have cache
hits. Therefore, CS, j+1 = CS, j and CS′, j+1 = CS′, j.
If | Dj | = k − 1 and bj ∉ Dj, then because CS, j = Dj ∪ { z}
and bj ≠ z, solution S has a cache miss. It evicts either block z or some block w ∈ Dj.
If solution S evicts block z, then CS, j+1 = Dj ∪ { bj}.
There are two cases, depending on whether bj = y:
If bj = y, then solution S′ has a cache hit, so
that CS′, j+1 = CS′, j = Dj ∪ { bj}. Thus, CS, j+1
= CS′, j +1.
If bj ≠ y, then solution S′ has a cache miss. It
evicts block y, so that CS′, j+1 = Dj ∪ { bj }, and again CS, j+1 = CS′, j+1.
If solution S evicts some block w ∈ Dj, then CS, j+1
= ( Dj − { w}) ∪ { bj, z}. Once again, there are two
cases, depending on whether bj = y:
If bj = y, then solution S′ has a cache hit, so
that CS′, j+1 = CS′, j = Dj ∪ { bj}. Since w ∈ Dj and w was not evicted by solution S′, we have
w ∈ CS′, j +1. Therefore, w ∉ Dj+1 and bj ∈
Dj+1, so that Dj+1 = ( Dj − { w}) ∪ { bj }. Thus, CS, j+1 = Dj+1 ∪ { z}, CS′, j+1 = Dj +1 ∪ { w}, and because w ≠ z, property 1 holds when
block bj+1 is requested. (In other words, block
w replaces block y in property 1.)
If bj ≠ y, then solution S′ has a cache miss. It
evicts block w, so that CS′, j +1 = ( Dj − { w}) ∪
{ bj, y}. Therefore, we have that Dj+1 = ( Dj −
{ w}) ∪ { bj } and so CS, j+1 = Dj+1 ∪ { z} and CS′, j+1 = Dj +1 ∪ { y}.
2. In the above discussion about maintaining property 1, solution S
may have a cache hit in only the first two cases, and solution S′