…, ki−1: no keys at all. Bear in mind, however, that subtrees also

contain dummy keys. We adopt the convention that a subtree

containing keys ki, …, ki−1 has no actual keys but does contain the single dummy key di−1. Symmetrically, if you select kj as the root, then kj’s right subtree contains the keys kj+1, …, kj. This right subtree contains no actual keys, but it does contain the dummy key dj.

Step 2: A recursive solution

Image 509

Image 510

To define the value of an optimal solution recursively, the subproblem

domain is finding an optimal binary search tree containing the keys ki,

…, kj, where i ≥ 1, jn, and ji − 1. (When j = i − 1, there is just the dummy key di−1, but no actual keys.) Let e[ i, j] denote the expected cost of searching an optimal binary search tree containing the keys ki, …, kj.

Your goal is to compute e[1, n], the expected cost of searching an optimal binary search tree for all the actual and dummy keys.

The easy case occurs when j = i − 1. Then the subproblem consists of

just the dummy key di−1. The expected search cost is e[ i, i − 1] = qi−1.

When ji, you need to select a root kr from among ki, …, kj and then make an optimal binary search tree with keys ki, …, kr−1 as its left subtree and an optimal binary search tree with keys kr+1, …, kj as its

right subtree. What happens to the expected search cost of a subtree

when it becomes a subtree of a node? The depth of each node in the

subtree increases by 1. By equation (14.11), the expected search cost of

this subtree increases by the sum of all the probabilities in the subtree.

For a subtree with keys ki, …, kj, denote this sum of probabilities as

Thus, if kr is the root of an optimal subtree containing keys ki, …, kj, we have

e[ i, j] = pr + ( e[ i, r − 1] + w( i, r − 1)) + ( e[ r + 1, j] + w( r + 1, j)).

Noting that

w( i, j) = w( i, r − 1) + pr + w( r + 1, j), we rewrite e[ i, j] as

The recursive equation (14.13) assumes that you know which node kr

to use as the root. Of course, you choose the root that gives the lowest

expected search cost, giving the final recursive formulation:

Image 511

The e[ i, j] values give the expected search costs in optimal binary search trees. To help keep track of the structure of optimal binary

search trees, define root[ i, j], for 1 ≤ ijn, to be the index r for which kr is the root of an optimal binary search tree containing keys ki, …, kj.

Although we’ll see how to compute the values of root[ i, j], the construction of an optimal binary search tree from these values is left as

Exercise 14.5-1.

Step 3: Computing the expected search cost of an optimal binary search

tree

At this point, you may have noticed some similarities between our

characterizations of optimal binary search trees and matrix-chain

multiplication. For both problem domains, the subproblems consist of

contiguous index subranges. A direct, recursive implementation of

equation (14.14) would be just as inefficient as a direct, recursive matrix-

chain multiplication algorithm. Instead, you can store the e[ i, j] values in a table e[1 : n + 1, 0 : n]. The first index needs to run to n + 1 rather than n because in order to have a subtree containing only the dummy

key dn, you need to compute and store e[ n + 1, n]. The second index needs to start from 0 because in order to have a subtree containing only

the dummy key d 0, you need to compute and store e[1, 0]. Only the entries e[ i, j] for which ji − 1 are filled in. The table root[ i, j] records the root of the subtree containing keys ki, …, kj and uses only the entries for which 1 ≤ ijn.

One other table makes the dynamic-programming algorithm a little

faster. Instead of computing the value of w( i, j) from scratch every time you compute e[ i, j], which would take Θ( ji) additions, store these values in a table w[1 : n + 1, 0 : n]. For the base case, compute w[ i, i − 1]

= qi−1 for 1 ≤ in + 1. For ji, compute

Image 512

Thus, you can compute the Θ( n 2) values of w[ i, j] in Θ(1) time each.

The OPTIMAL-BST procedure on the next page takes as inputs the

probabilities p 1, …, pn and q 0, …, qn and the size n, and it returns the tables e and root. From the description above and the similarity to the MATRIX-CHAIN-ORDER procedure in Section 14.2, you should find

the operation of this procedure to be fairly straightforward. The for

loop of lines 2–4 initializes the values of e[ i, i − 1]and w[ i, i − 1]. Then the for loop of lines 5–14 uses the recurrences (14.14) and (14.15) to

compute e[ i, j] and w[ i, j] for all 1 ≤ ijn. In the first iteration, when l

= 1, the loop computes e[ i, i] and w[ i, i] for i = 1, 2, …, n. The second iteration, with l = 2, computes e[ i, i + 1] and w[ i, i + 1] for i = 1, 2, …, n

− 1, and so on. The innermost for loop, in lines 10–14, tries each

candidate index r to determine which key kr to use as the root of an optimal binary search tree containing keys ki, …, kj. This for loop saves the current value of the index r in root[ i, j] whenever it finds a better key to use as the root.

OPTIMAL-BST( p, q, n)

1let e[1 : n + 1, 0 : n], w[1 : n + 1, 0 : n], and root[1 : n, 1 : n] be new tables

2for i = 1 to n + 1

// base cases

3

e[ i, i − 1] = qi−1

// equation (14.14)

4

w[ i, i − 1] = qi−1

5for l = 1 to n

6

for i = 1 to nl + 1

7

j = i + l − 1

8

e[ i, j] = ∞

9

w[ i, j] = w[ i, j − 1] + pj + qj // equation (14.15) 10

for r = i to j

// try all possible roots r

11

t = e[ i, r − 1] + e[ r + 1, j] + w[ i, j] // equation (14.14) 12

if t < e[ i, j]

// new minimum?

13

e[ i, j] = t

14

root[ i, j] = r

15return e and root

Figure 14.10 shows the tables e[ i, j], w[ i, j], and root[ i, j] computed by

the procedure OPTIMAL-BST on the key distribution shown in Figure

14.9. As in the matrix-chain multiplication example of Figure 14.5, the

tables are rotated to make the diagonals run horizontally. OPTIMAL-

BST computes the rows from bottom to top and from left to right

within each row.

The OPTIMAL-BST procedure takes Θ( n 3) time, just like

MATRIX-CHAIN-ORDER. Its running time is O( n 3), since its for

loops are nested three deep and each loop index takes on at most n

values. The loop indices in OPTIMAL-BST do not have exactly the

same bounds as those in MATRIX-CHAIN-ORDER, but they are

within at most 1 in all directions. Thus, like MATRIX-CHAIN-

ORDER, the OPTIMAL-BST procedure takes Ω( n 3) time.

Image 513

Figure 14.10 The tables e[ i, j], w[ i, j], and root[ i, j] computed by OPTIMAL-BST on the key distribution shown in Figure 14.9. The tables are rotated so that the diagonals run horizontally.

Exercises

14.5-1

Write pseudocode for the procedure CONSTRUCT-OPTIMAL-

BST( root, n) which, given the table root[1 : n, 1 : n], outputs the

structure of an optimal binary search tree. For the example in Figure

14.10, your procedure should print out the structure

k 2 is the root

k 1 is the left child of k 2

d 0 is the left child of k 1

d 1 is the right child of k 1

k 5 is the right child of k 2

k 4 is the left child of k 5

k 3 is the left child of k 4

d 2 is the left child of k 3

d 3 is the right child of k 3

d 4 is the right child of k 4

d 5 is the right child of k 5

corresponding to the optimal binary search tree shown in Figure

14.9(b).

14.5-2

Determine the cost and structure of an optimal binary search tree for a

set of n = 7 keys with the following probabilities:

i

0

1

2

3

4

5

6

7

pi

0.04

0.06

0.08

0.02

0.10

0.12

0.14

qi 0.06 0.06 0.06 0.06 0.05 0.05 0.05

0.05

14.5-3

Suppose that instead of maintaining the table w[ i, j], you computed the value of w( i, j) directly from equation (14.12) in line 9 of OPTIMAL-BST and used this computed value in line 11. How would this change

affect the asymptotic running time of OPTIMAL-BST?

14.5-4

Knuth [264] has shown that there are always roots of optimal subtrees such that root[ i, j − 1] ≤ root[ i, j] ≤ root[ i + 1, j] for all 1 ≤ i < jn. Use this fact to modify the OPTIMAL-BST procedure to run in Θ( n 2) time.

Problems

14-1 Longest simple path in a directed acyclic graph

You are given a directed acyclic graph G = ( V, E) with real-valued edge weights and two distinguished vertices s and t. The weight of a path is the sum of the weights of the edges in the path. Describe a dynamic-

Image 514

programming approach for finding a longest weighted simple path from

s to t. What is the running time of your algorithm?

14-2 Longest palindrome subsequence

A palindrome is a nonempty string over some alphabet that reads the

same forward and backward. Examples of palindromes are all strings of

length 1, civic, racecar, and aibohphobia (fear of palindromes).

Give an efficient algorithm to find the longest palindrome that is a

subsequence of a given input string. For example, given the input

character, your algorithm should return carac. What is the running

time of your algorithm?

14-3 Bitonic euclidean traveling-salesperson problem

In the euclidean traveling-salesperson problem, you are given a set of n

points in the plane, and your goal is to find the shortest closed tour that

connects all n points.

Figure 14.11 Seven points in the plane, shown on a unit grid. (a) The shortest closed tour, with length approximately 24.89. This tour is not bitonic. (b) The shortest bitonic tour for the same set of points. Its length is approximately 25.58.

Figure 14.11(a) shows the solution to a 7-point problem. The general problem is NP-hard, and its solution is therefore believed to require

more than polynomial time (see Chapter 34).

J. L. Bentley has suggested simplifying the problem by considering

only bitonic tours, that is, tours that start at the leftmost point, go strictly rightward to the rightmost point, and then go strictly leftward

back to the starting point. Figure 14.11(b) shows the shortest bitonic

Image 515

tour of the same 7 points. In this case, a polynomial-time algorithm is

possible.

Describe an O( n 2)-time algorithm for determining an optimal

bitonic tour. You may assume that no two points have the same x-

coordinate and that all operations on real numbers take unit time.

( Hint: Scan left to right, maintaining optimal possibilities for the two

parts of the tour.)

14-4 Printing neatly

Consider the problem of neatly printing a paragraph with a

monospaced font (all characters having the same width). The input text

is a sequence of n words of lengths l 1, l 2, …, ln, measured in characters, which are to be printed neatly on a number of lines that hold a

maximum of M characters each. No word exceeds the line length, so

that liM for i = 1, 2, …, n. The criterion of “neatness” is as follows. If a given line contains words i through j, where ij, and exactly one space appears between words, then the number of extra space characters

at the end of the line is

, which must be nonnegative

so that the words fit on the line. The goal is to minimize the sum, over

all lines except the last, of the cubes of the numbers of extra space

characters at the ends of lines. Give a dynamic-programming algorithm

to print a paragraph of n words neatly. Analyze the running time and

space requirements of your algorithm.

14-5 Edit distance

In order to transform a source string of text x[1 : m] to a target string y[1 : n], you can perform various transformation operations. The goal is, given x and y, to produce a series of transformations that changes x to y. An array z—assumed to be large enough to hold all the characters it

needs—holds the intermediate results. Initially, z is empty, and at

termination, you should have z[ j] = y[ j] for j = 1, 2, …, n. The procedure for solving this problem maintains current indices i into x and j into z, and the operations are allowed to alter z and these indices. Initially, i = j

= 1. Every character in x must be examined during the transformation,

which means that at the end of the sequence of transformation operations, i = m + 1.

You may choose from among six transformation operations, each of

which has a constant cost that depends on the operation:

Copy a character from x to z by setting z[ j] = x[ i] and then incrementing both i and j. This operation examines x[ i] and has cost QC.

Replace a character from x by another character c, by setting z[ j] = c, and then incrementing both i and j. This operation examines x[ i] and has cost QR.

Delete a character from x by incrementing i but leaving j alone. This operation examines x[ i] and has cost QD.

Insert the character c into z by setting z[ j] = c and then incrementing j, but leaving i alone. This operation examines no characters of x and has cost QI.

Twiddle (i.e., exchange) the next two characters by copying them from x

to z but in the opposite order: setting z[ j] = x[ i + 1] and z[ j + 1] = x[ i], and then setting i = i + 2 and j = j + 2. This operation examines x[ i]

and x[ i + 1] and has cost QT.

Kill the remainder of x by setting i = m + 1. This operation examines all characters in x that have not yet been examined. This operation, if

performed, must be the final operation. It has cost QK.

Figure 14.12 gives one way to transform the source string

algorithm to the target string altruistic. Several other sequences

of transformation operations can transform algorithm to

altruistic.

Assume that QC < QD + QI and QR < QD + QI, since otherwise, the copy and replace operations would not be used. The cost of a given

sequence of transformation operations is the sum of the costs of the

individual operations in the sequence. For the sequence above, the cost

of transforming algorithm to altruistic is 3 QC + QR + QD +

4 QI + QT + QK.

Image 516

a. Given two sequences x[1 : m] and y[1 : n] and the costs of the transformation operations, the edit distance from x to y is the cost of the least expensive operation sequence that transforms x to y.

Describe a dynamic-programming algorithm that finds the edit

distance from x[1 : m] to y[1 : n] and prints an optimal operation sequence. Analyze the running time and space requirements of your

algorithm.

Figure 14.12 A sequence of operations that transforms the source algorithm to the target string altruistic. The underlined characters are x[ i] and z[ j] after the operation.

The edit-distance problem generalizes the problem of aligning two

DNA sequences (see, for example, Setubal and Meidanis [405, Section

3.2]). There are several methods for measuring the similarity of two

DNA sequences by aligning them. One such method to align two

sequences x and y consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting

sequences x′ and y′ have the same length but do not have a space in the same position (i.e., for no position j are both x′[ j] and y′[ j] a space).

Then we assign a “score” to each position. Position j receives a score as

follows:

+1 if x′[ j] = y′[ j] and neither is a space,

−1 if x′[ j] ≠ y′[ j] and neither is a space,

−2 if either x′[ j] or y′[ j] is a space.

The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x = GATCGGCAT and y =

CAATGTGAATC, one alignment is

G ATCG GCAT

CAAT GTGAATC

-*++*+*+-++*

A + under a position indicates a score of +1 for that position, a -

indicates a score of −1, and a * indicates a score of −2, so that this

alignment has a total score of 6 · 1 − 2 · 1 − 4 · 2 = −4.

b. Explain how to cast the problem of finding an optimal alignment as

an edit-distance problem using a subset of the transformation

operations copy, replace, delete, insert, twiddle, and kill.

14-6 Planning a company party

Professor Blutarsky is consulting for the president of a corporation that

is planning a company party. The company has a hierarchical structure,

that is, the supervisor relation forms a tree rooted at the president. The

human resources department has ranked each employee with a

conviviality rating, which is a real number. In order to make the party

fun for all attendees, the president does not want both an employee and

his or her immediate supervisor to attend.

Professor Blutarsky is given the tree that describes the structure of

the corporation, using the left-child, right-sibling representation

described in Section 10.3. Each node of the tree holds, in addition to the pointers, the name of an employee and that employee’s conviviality

ranking. Describe an algorithm to make up a guest list that maximizes

the sum of the conviviality ratings of the guests. Analyze the running

time of your algorithm.

14-7 Viterbi algorithm

Dynamic programming on a directed graph can play a part in speech

recognition. A directed graph G = ( V, E) with labeled edges forms a formal model of a person speaking a restricted language. Each edge ( u,

v) ∈ E is labeled with a sound σ( u, v) from a finite set Σ of sounds. Each

directed path in the graph starting from a distinguished vertex v 0 ∈ V

corresponds to a possible sequence of sounds produced by the model,

with the label of a path being the concatenation of the labels of the

edges on that path.

a. Describe an efficient algorithm that, given an edge-labeled directed

graph G with distinguished vertex v 0 and a sequence s = 〈 σ 1, σ 2, …, σk〉 of sounds from Σ, returns a path in G that begins at v 0 and has s as its label, if any such path exists. Otherwise, the algorithm should

return NO-SUCH-PATH. Analyze the running time of your

algorithm. ( Hint: You may find concepts from Chapter 20 useful.) Now suppose that every edge ( u, v) ∈ E has an associated nonnegative probability p( u, v) of being traversed, so that the corresponding sound is produced. The sum of the probabilities of the edges leaving any vertex

equals 1. The probability of a path is defined to be the product of the

probabilities of its edges. Think of the probability of a path beginning at

vertex v 0 as the probability that a “random walk” beginning at v 0

follows the specified path, where the edge leaving a vertex u is taken randomly, according to the probabilities of the available edges leaving u.

b. Extend your answer to part (a) so that if a path is returned, it is a

most probable path starting at vertex v 0 and having label s. Analyze the running time of your algorithm.

14-8 Image compression by seam carving

Suppose that you are given a color picture consisting of an m× n array

A[1 : m, 1 : n] of pixels, where each pixel specifies a triple of red, green, and blue (RGB) intensities. You want to compress this picture slightly,

by removing one pixel from each of the m rows, so that the whole

picture becomes one pixel narrower. To avoid incongruous visual effects,

however, the pixels removed in two adjacent rows must lie in either the

same column or adjacent columns. In this way, the pixels removed form

a “seam” from the top row to the bottom row, where successive pixels in

the seam are adjacent vertically or diagonally.

a. Show that the number of such possible seams grows at least exponentially in m, assuming that n > 1.

b. Suppose now that along with each pixel A[ i, j], you are given a real-valued disruption measure d[ i, j], indicating how disruptive it would be to remove pixel A[ i, j]. Intuitively, the lower a pixel’s disruption measure, the more similar the pixel is to its neighbors. Define the

disruption measure of a seam as the sum of the disruption measures

of its pixels.

Give an algorithm to find a seam with the lowest disruption measure.

How efficient is your algorithm?

14-9 Breaking a string

A certain string-processing programming language allows you to break

a string into two pieces. Because this operation copies the string, it costs

n time units to break a string of n characters into two pieces. Suppose that you want to break a string into many pieces. The order in which the

breaks occur can affect the total amount of time used. For example,

suppose that you want to break a 20-character string after characters 2,

8, and 10 (numbering the characters in ascending order from the left-

hand end, starting from 1). If you program the breaks to occur in left-

to-right order, then the first break costs 20 time units, the second break

costs 18 time units (breaking the string from characters 3 to 20 at

character 8), and the third break costs 12 time units, totaling 50 time

units. If you program the breaks to occur in right-to-left order, however,

then the first break costs 20 time units, the second break costs 10 time

units, and the third break costs 8 time units, totaling 38 time units. In

yet another order, you could break first at 8 (costing 20), then break the

left piece at 2 (costing another 8), and finally the right piece at 10

(costing 12), for a total cost of 40.

Design an algorithm that, given the numbers of characters after

which to break, determines a least-cost way to sequence those breaks.

More formally, given an array L[1 : m] containing the break points for a string of n characters, compute the lowest cost for a sequence of breaks,

along with a sequence of breaks that achieves this cost.

14-10 Planning an investment strategy

Your knowledge of algorithms helps you obtain an exciting job with a

hot startup, along with a $10,000 signing bonus. You decide to invest

this money with the goal of maximizing your return at the end of 10

years. You decide to use your investment manager, G. I. Luvcache, to

manage your signing bonus. The company that Luvcache works with

requires you to observe the following rules. It offers n different

investments, numbered 1 through n. In each year j, investment i provides a return rate of rij. In other words, if you invest d dollars in investment i in year j, then at the end of year j, you have drij dollars. The return rates are guaranteed, that is, you are given all the return rates for the next 10

years for each investment. You make investment decisions only once per

year. At the end of each year, you can leave the money made in the

previous year in the same investments, or you can shift money to other

investments, by either shifting money between existing investments or

moving money to a new investment. If you do not move your money

between two consecutive years, you pay a fee of f 1 dollars, whereas if

you switch your money, you pay a fee of f 2 dollars, where f 2 > f 1. You pay the fee once per year at the end of the year, and it is the same

amount, f 2, whether you move money in and out of only one

investment, or in and out of many investments.

a. The problem, as stated, allows you to invest your money in multiple

investments in each year. Prove that there exists an optimal investment

strategy that, in each year, puts all the money into a single investment.

(Recall that an optimal investment strategy maximizes the amount of

money after 10 years and is not concerned with any other objectives,

such as minimizing risk.)

b. Prove that the problem of planning your optimal investment strategy

exhibits optimal substructure.

c. Design an algorithm that plans your optimal investment strategy.

What is the running time of your algorithm?

Image 517

d. Suppose that Luvcache’s company imposes the additional restriction

that, at any point, you can have no more than $15,000 in any one

investment. Show that the problem of maximizing your income at the

end of 10 years no longer exhibits optimal substructure.

14-11 Inventory planning

The Rinky Dink Company makes machines that resurface ice rinks. The

demand for such products varies from month to month, and so the

company needs to develop a strategy to plan its manufacturing given

the fluctuating, but predictable, demand. The company wishes to design

a plan for the next n months. For each month i, the company knows the

demand di, that is, the number of machines that it will sell. Let

be the total demand over the next n months. The company keeps a full-

time staff who provide labor to manufacture up to m machines per

month. If the company needs to make more than m machines in a given

month, it can hire additional, part-time labor, at a cost that works out

to c dollars per machine. Furthermore, if the company is holding any

unsold machines at the end of a month, it must pay inventory costs. The

company can hold up to D machines, with the cost for holding j

machines given as a function h( j) for j = 1, 2, …, D that monotonically increases with j.

Give an algorithm that calculates a plan for the company that

minimizes its costs while fulfilling all the demand. The running time

should be polynomial in n and D.

14-12 Signing free-agent baseball players

Suppose that you are the general manager for a major-league baseball

team. During the off-season, you need to sign some free-agent players

for your team. The team owner has given you a budget of $ X to spend

on free agents. You are allowed to spend less than $ X, but the owner will fire you if you spend any more than $ X.

You are considering N different positions, and for each position, P

free-agent players who play that position are available.10 Because you do not want to overload your roster with too many players at any

position, for each position you may sign at most one free agent who

plays that position. (If you do not sign any players at a particular position, then you plan to stick with the players you already have at that

position.)

To determine how valuable a player is going to be, you decide to use

a sabermetric statistic11 known as “WAR,” or “wins above

replacement.” A player with a higher WAR is more valuable than a

player with a lower WAR. It is not necessarily more expensive to sign a

player with a higher WAR than a player with a lower WAR, because

factors other than a player’s value determine how much it costs to sign

them.

For each available free-agent player p, you have three pieces of

information:

the player’s position,

p.cost, the amount of money it costs to sign the player, and

p.war, the player’s WAR.

Devise an algorithm that maximizes the total WAR of the players

you sign while spending no more than $ X. You may assume that each

player signs for a multiple of $100,000. Your algorithm should output

the total WAR of the players you sign, the total amount of money you

spend, and a list of which players you sign. Analyze the running time

and space requirement of your algorithm.

Chapter notes

Bellman [44] began the systematic study of dynamic programming in 1955, publishing a book about it in 1957. The word “programming,”

both here and in linear programming, refers to using a tabular solution

method. Although optimization techniques incorporating elements of

dynamic programming were known earlier, Bellman provided the area

with a solid mathematical basis.

Galil and Park [172] classify dynamic-programming algorithms according to the size of the table and the number of other table entries

each entry depends on. They call a dynamic-programming algorithm

tD/ eD if its table size is O( nt) and each entry depends on O( ne) other entries. For example, the matrix-chain multiplication algorithm in

Section 14.2 is 2 D/1 D, and the longest-common-subsequence algorithm in Section 14.4 is 2 D/0 D.

The MATRIX-CHAIN-ORDER algorithm on page 378 is by

Muraoka and Kuck [339]. Hu and Shing [230, 231] give an O( n lg n)-

time algorithm for the matrix-chain multiplication problem.

The O( mn)-time algorithm for the longest-common-subsequence

problem appears to be a folk algorithm. Knuth [95] posed the question of whether subquadratic algorithms for the LCS problem exist. Masek

and Paterson [316] answered this question in the affirmative by giving an algorithm that runs in O( mn/lg n) time, where nm and the sequences are drawn from a set of bounded size. For the special case in

which no element appears more than once in an input sequence,

Szymanski [425] shows how to solve the problem in O(( n + m) lg( n +

m)) time. Many of these results extend to the problem of computing

string edit distances (Problem 14-5).

An early paper on variable-length binary encodings by Gilbert and

Moore [181], which had applications to constructing optimal binary search trees for the case in which all probabilities pi are 0, contains an

O( n 3)-time algorithm. Aho, Hopcroft, and Ullman [5] present the algorithm from Section 14.5. Splay trees [418], which modify the tree in response to the search queries, come within a constant factor of the

optimal bounds without being initialized with the frequencies. Exercise

14.5-4 is due to Knuth [264]. Hu and Tucker [232] devised an algorithm for the case in which all probabilities pi are 0 that uses O( n 2) time and O( n) space. Subsequently, Knuth [261] reduced the time to O( n lg n).

Problem 14-8 is due to Avidan and Shamir [30], who have posted on

the web a wonderful video illustrating this image-compression

technique.

1 If pieces are required to be cut in order of monotonically increasing size, there are fewer ways

to consider. For n = 4, only 5 such ways are possible: parts (a), (b), (c), (e), and (h) in Figure

Image 518

Image 519

14.2. The number of ways is called the partition function, which is approximately equal to

. This quantity is less than 2 n−1, but still much greater than any polynomial in n.

We won’t pursue this line of inquiry further, however.

2 The technical term “memoization” is not a misspelling of “memorization.” The word

“memoization” comes from “memo,” since the technique consists of recording a value to be looked up later.

3 None of the three methods from Sections 4.1 and Section 4.2 can be used directly, because they apply only to square matrices.

4 The term counts all pairs in which i < j. Because i and j may be equal, we need to add in the n term.

5 We use the term “unweighted” to distinguish this problem from that of finding shortest paths with weighted edges, which we shall see in Chapters 22 and 23. You can use the breadth-first search technique of Chapter 20 to solve the unweighted problem.

6 It may seem strange that dynamic programming relies on subproblems being both independent and overlapping. Although these requirements may sound contradictory, they describe two different notions, rather than two points on the same axis. Two subproblems of the same problem are independent if they do not share resources. Two subproblems are overlapping if they are really the same subproblem that occurs as a subproblem of different problems.

7 This approach presupposes that you know the set of all possible subproblem parameters and that you have established the relationship between table positions and subproblems. Another, more general, approach is to memoize by using hashing with the subproblem parameters as keys.

8 If the subject of the text is ancient Rome, you might want naumachia to appear near the root.

9 Yes, naumachia has a Latvian counterpart: nomačija.

10 Although there are nine positions on a baseball team, N is not necessarily equal to 9 because some general managers have particular ways of thinking about positions. For example, a general manager might consider right-handed pitchers and left-handed pitchers to be separate

“positions,” as well as starting pitchers, long relief pitchers (relief pitchers who can pitch several innings), and short relief pitchers (relief pitchers who normally pitch at most only one inning).

11 Sabermetrics is the application of statistical analysis to baseball records. It provides several ways to compare the relative values of individual players.

15 Greedy Algorithms

Algorithms for optimization problems typically go through a sequence

of steps, with a set of choices at each step. For many optimization

problems, using dynamic programming to determine the best choices is

overkill, and simpler, more efficient algorithms will do. A greedy

algorithm always makes the choice that looks best at the moment. That

is, it makes a locally optimal choice in the hope that this choice leads to

a globally optimal solution. This chapter explores optimization

problems for which greedy algorithms provide optimal solutions. Before

reading this chapter, you should read about dynamic programming in

Chapter 14, particularly Section 14.3.

Greedy algorithms do not always yield optimal solutions, but for

many problems they do. We first examine, in Section 15.1, a simple but nontrivial problem, the activity-selection problem, for which a greedy

algorithm efficiently computes an optimal solution. We’ll arrive at the

greedy algorithm by first considering a dynamic-programming approach

and then showing that an optimal solution can result from always

making greedy choices. Section 15.2 reviews the basic elements of the greedy approach, giving a direct approach for proving greedy

algorithms correct. Section 15.3 presents an important application of greedy techniques: designing data-compression (Huffman) codes.

Finally, Section 15.4 shows that in order to decide which blocks to replace when a miss occurs in a cache, the “furthest-in-future” strategy

is optimal if the sequence of block accesses is known in advance.

The greedy method is quite powerful and works well for a wide range

of problems. Later chapters will present many algorithms that you can

Image 520

view as applications of the greedy method, including minimum-

spanning-tree algorithms (Chapter 21), Dijkstra’s algorithm for shortest paths from a single source (Section 22.3), and a greedy set-covering heuristic (Section 35.3). Minimum-spanning-tree algorithms furnish a classic example of the greedy method. Although you can read this

chapter and Chapter 21 independently of each other, you might find it useful to read them together.

15.1 An activity-selection problem

Our first example is the problem of scheduling several competing

activities that require exclusive use of a common resource, with a goal of

selecting a maximum-size set of mutually compatible activities. Imagine

that you are in charge of scheduling a conference room. You are

presented with a set S = { a 1, a 2, … , an} of n proposed activities that wish to reserve the conference room, and the room can serve only one

activity at a time. Each activity ai has a start time si and a finish time fi, where 0 ≤ si < fi < ∞. If selected, activity ai takes place during the half-open time interval [ si, fi). Activities ai and aj are compatible if the intervals [ si, fi) and [ sj, fj) do not overlap. That is, ai and aj are compatible if sifj or sjfi. (Assume that if your staff needs time to change over the room from one activity to the next, the changeover time

is built into the intervals.) In the activity-selection problem, your goal is

to select a maximum-size subset of mutually compatible activities.

Assume that the activities are sorted in monotonically increasing order

of finish time:

(We’ll see later the advantage that this assumption provides.) For

example, consider the set of activities in Figure 15.1. The subset { a 3, a 9, a 11} consists of mutually compatible activities. It is not a maximum subset, however, since the subset { a 1, a 4, a 8, a 11} is larger. In fact, { a 1,

Image 521

a 4, a 8, a 11} is a largest subset of mutually compatible activities, and another largest subset is { a 2, a 4, a 9, a 11}.

We’ll see how to solve this problem, proceeding in several steps. First

we’ll explore a dynamic-programming solution, in which you consider

several choices when determining which subproblems to use in an

optimal solution. We’ll then observe that you need to consider only one

choice—the greedy choice—and that when you make the greedy choice,

only one subproblem remains. Based on these observations, we’ll

develop a recursive greedy algorithm to solve the activity-selection

problem. Finally, we’ll complete the process of developing a greedy

solution by converting the recursive algorithm to an iterative one.

Although the steps we go through in this section are slightly more

involved than is typical when developing a greedy algorithm, they

illustrate the relationship between greedy algorithms and dynamic

programming.

Figure 15.1 A set { a 1, a 2, … , a 11} of activities. Activity ai has start time si and finish time fi.

The optimal substructure of the activity-selection problem

Let’s verify that the activity-selection problem exhibits optimal

substructure. Denote by Sij the set of activities that start after activity ai finishes and that finish before activity aj starts. Suppose that you want

to find a maximum set of mutually compatible activities in Sij, and

suppose further that such a maximum set is Aij, which includes some

activity ak. By including ak in an optimal solution, you are left with two subproblems: finding mutually compatible activities in the set Sik

(activities that start after activity ai finishes and that finish before activity ak starts) and finding mutually compatible activities in the set

Skj (activities that start after activity ak finishes and that finish before

Image 522

Image 523

Image 524

Image 525

Image 526

activity aj starts). Let Aik = AijSik and Akj = AijSkj, so that Aik contains the activities in Aij that finish before ak starts and Akj contains the activities in Aij that start after ak finishes. Thus, we have Aij = Aik

{ ak} ∪ Akj, and so the maximum-size set Aij of mutually compatible activities in Sij consists of | Aij | = | Aik| + | Akj | + 1 activities.

The usual cut-and-paste argument shows that an optimal solution

Aij must also include optimal solutions to the two subproblems for Sik

and Skj. If you could find a set of mutually compatible activities in

Skj where

, then you could use

, rather than Akj, in a

solution to the subproblem for Sij. You would have constructed a set of

mutually compatible activities,

which contradicts the assumption that Aij is an optimal solution. A

symmetric argument applies to the activities in Sik.

This way of characterizing optimal substructure suggests that you

can solve the activity-selection problem by dynamic programming. Let’s

denote the size of an optimal solution for the set Sij by c[ i, j]. Then, the dynamic-programming approach gives the recurrence

c[ i, j] = c[ i, k] + c[ k, j] + 1.

Of course, if you do not know that an optimal solution for the set Sij

includes activity ak, you must examine all activities in Sij to find which one to choose, so that

You can then develop a recursive algorithm and memoize it, or you can

work bottom-up and fill in table entries as you go along. But you would

be overlooking another important characteristic of the activity-selection

problem that you can use to great advantage.

Making the greedy choice

What if you could choose an activity to add to an optimal solution without having to first solve all the subproblems? That could save you

from having to consider all the choices inherent in recurrence (15.2). In

fact, for the activity-selection problem, you need to consider only one

choice: the greedy choice.

What is the greedy choice for the activity-selection problem?

Intuition suggests that you should choose an activity that leaves the

resource available for as many other activities as possible. Of the

activities you end up choosing, one of them must be the first one to

finish. Intuition says, therefore, choose the activity in S with the earliest

finish time, since that leaves the resource available for as many of the

activities that follow it as possible. (If more than one activity in S has

the earliest finish time, then choose any such activity.) In other words,

since the activities are sorted in monotonically increasing order by finish

time, the greedy choice is activity a 1. Choosing the first activity to finish

is not the only way to think of making a greedy choice for this problem.

Exercise 15.1-3 asks you to explore other possibilities.

Once you make the greedy choice, you have only one remaining

subproblem to solve: finding activities that start after a 1 finishes. Why

don’t you have to consider activities that finish before a 1 starts? Because

s 1 < f 1, and because f 1 is the earliest finish time of any activity, no activity can have a finish time less than or equal to s 1. Thus, all activities that are compatible with activity a 1 must start after a 1 finishes.

Furthermore, we have already established that the activity-selection

problem exhibits optimal substructure. Let Sk = { aiS : sifk} be the set of activities that start after activity ak finishes. If you make the greedy choice of activity a 1, then S 1 remains as the only subproblem to solve. 1 Optimal substructure says that if a 1 belongs to an optimal solution, then an optimal solution to the original problem consists of

activity a 1 and all the activities in an optimal solution to the

subproblem S 1.

One big question remains: Is this intuition correct? Is the greedy

choice—in which you choose the first activity to finish—always part of

Image 527

Image 528

Image 529

Image 530

some optimal solution? The following theorem shows that it is.

Theorem 15.1

Consider any nonempty subproblem Sk, and let am be an activity in Sk with the earliest finish time. Then am is included in some maximum-size

subset of mutually compatible activities of Sk.

Proof Let Ak be a maximum-size subset of mutually compatible activities in Sk, and let aj be the activity in Ak with the earliest finish time. If aj = am, we are done, since we have shown that am belongs to some maximum-size subset of mutually compatible activities of Sk. If aj

am, let the set

be Ak but substituting am for aj.

The activities in are compatible, which follows because the activities

in Ak are compatible, aj is the first activity in Ak to finish, and fmfj.

Since

, we conclude that is a maximum-size subset of

mutually compatible activities of Sk, and it includes am.

Although you might be able to solve the activity-selection problem

with dynamic programming, Theorem 15.1 says that you don’t need to.

Instead, you can repeatedly choose the activity that finishes first, keep

only the activities compatible with this activity, and repeat until no

activities remain. Moreover, because you always choose the activity with

the earliest finish time, the finish times of the activities that you choose

must strictly increase. You can consider each activity just once overall,

in monotonically increasing order of finish times.

An algorithm to solve the activity-selection problem does not need

to work bottom-up, like a table-based dynamic-programming

algorithm. Instead, it can work top-down, choosing an activity to put

into the optimal solution that it constructs and then solving the

subproblem of choosing activities from those that are compatible with

those already chosen. Greedy algorithms typically have this top-down

design: make a choice and then solve a subproblem, rather than the

bottom-up technique of solving subproblems before making a choice.

A recursive greedy algorithm

Now that you know you can bypass the dynamic-programming

approach and instead use a top-down, greedy algorithm, let’s see a

straightforward, recursive procedure to solve the activity-selection

problem. The procedure RECURSIVE-ACTIVITY-SELECTOR on

the following page takes the start and finish times of the activities,

represented as arrays s and f, 2 the index k that defines the subproblem Sk it is to solve, and the size n of the original problem. It returns a maximum-size set of mutually compatible activities in Sk. The

procedure assumes that the n input activities are already ordered by

monotonically increasing finish time, according to equation (15.1). If

not, you can first sort them into this order in O( n lg n) time, breaking ties arbitrarily. In order to start, add the fictitious activity a 0 with f 0 =

0, so that subproblem S 0 is the entire set of activities S. The initial call, which solves the entire problem, is RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, n).

RECURSIVE-ACTIVITY-SELECTOR ( s, f, k, n)

1 m = k + 1

2 while mn and s[ m] < f [ k] // find the first activity in Sk to finish 3

m = m + 1

4 if mn

5

return { am} ∪ RECURSIVE-ACTIVITY-SELECTOR ( s, f, m,

n)

6 else return ∅

Figure 15.2 shows how the algorithm operates on the activities in

Figure 15.1. In a given recursive call RECURSIVE-ACTIVITY-

SELECTOR ( s, f, k, n), the while loop of lines 2–3 looks for the first activity in Sk to finish. The loop examines ak+1, ak+2, … , an, until it finds the first activity am that is compatible with ak, which means that smfk. If the loop terminates because it finds such an activity, line 5

returns the union of { am} and the maximum-size subset of Sm returned by the recursive call RECURSIVE-ACTIVITY-SELECTOR ( s, f, m, n). Alternatively, the loop may terminate because m > n, in which case the procedure has examined all activities in Sk without finding one that

is compatible with ak. In this case, Sk = ∅ , and so line 6 returns ∅ .

Assuming that the activities have already been sorted by finish times,

the running time of the call RECURSIVE-ACTIVITY-SELECTOR ( s,

f, 0, n) is Θ( n). To see why, observe that over all recursive calls, each activity is examined exactly once in the while loop test of line 2. In

particular, activity ai is examined in the last call made in which k < i.

An iterative greedy algorithm

The recursive procedure can be converted to an iterative one because the

procedure RECURSIVE-ACTIVITY-SELECTOR is almost “tail

recursive” (see Problem 7-5): it ends with a recursive call to itself

followed by a union operation. It is usually a straightforward task to

transform a tail-recursive procedure to an iterative form. In fact, some

compilers for certain programming languages perform this task

automatically.

Image 531

Figure 15.2 The operation of RECURSIVE-ACTIVITY-SELECTOR on the 11 activities from

Figure 15.1. Activities considered in each recursive call appear between horizontal lines. The fictitious activity a 0 finishes at time 0, and the initial call RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, 11), selects activity a 1. In each recursive call, the activities that have already been selected are blue, and the activity shown in tan is being considered. If the starting time of an activity occurs before the finish time of the most recently added activity (the arrow between them points left), it is rejected. Otherwise (the arrow points directly up or to the right), it is selected. The last recursive call, RECURSIVE-ACTIVITY-SELECTOR ( s, f, 11, 11), returns

∅ . The resulting set of selected activities is { a 1, a 4, a 8, a 11}.

Image 532

The procedure GREEDY-ACTIVITY-SELECTOR is an iterative

version of the procedure RECURSIVE-ACTIVITY-SELECTOR. It,

too, assumes that the input activities are ordered by monotonically

increasing finish time. It collects selected activities into a set A and returns this set when it is done.

GREEDY-ACTIVITY-SELECTOR ( s, f, n)

1 A = { a 1}

2 k = 1

3 for m = 2 to n

4

if s[ m] ≥ f [ k]

// is am in Sk?

5

A = A ∪ { am}

// yes, so choose it

6

k = m

// and continue from there

7 return A

The procedure works as follows. The variable k indexes the most

recent addition to A, corresponding to the activity ak in the recursive version. Since the procedure considers the activities in order of

monotonically increasing finish time, fk is always the maximum finish

time of any activity in A. That is,

Lines 1–2 select activity a 1, initialize A to contain just this activity, and initialize k to index this activity. The for loop of lines 3–6 finds the earliest activity in Sk to finish. The loop considers each activity am in turn and adds am to A if it is compatible with all previously selected activities. Such an activity is the earliest in Sk to finish. To see whether

activity am is compatible with every activity currently in A, it suffices by equation (15.3) to check (in line 4) that its start time sm is not earlier

than the finish time fk of the activity most recently added to A. If activity am is compatible, then lines 5–6 add activity am to A and set k to m. The set A returned by the call GREEDY-ACTIVITY-

SELECTOR ( s, f) is precisely the set returned by the initial call RECURSIVE-ACTIVITY-SELECTOR ( s, f, 0, n).

Like the recursive version, GREEDY-ACTIVITY-SELECTOR

schedules a set of n activities in Θ( n) time, assuming that the activities were already sorted initially by their finish times.

Exercises

15.1-1

Give a dynamic-programming algorithm for the activity-selection

problem, based on recurrence (15.2). Have your algorithm compute the

sizes c[ i, j] as defined above and also produce the maximum-size subset of mutually compatible activities. Assume that the inputs have been

sorted as in equation (15.1). Compare the running time of your solution

to the running time of GREEDY-ACTIVITY-SELECTOR.

15.1-2

Suppose that instead of always selecting the first activity to finish, you

instead select the last activity to start that is compatible with all

previously selected activities. Describe how this approach is a greedy

algorithm, and prove that it yields an optimal solution.

15.1-3

Not just any greedy approach to the activity-selection problem produces

a maximum-size set of mutually compatible activities. Give an example

to show that the approach of selecting the activity of least duration

from among those that are compatible with previously selected activities

does not work. Do the same for the approaches of always selecting the

compatible activity that overlaps the fewest other remaining activities

and always selecting the compatible remaining activity with the earliest

start time.

15.1-4

You are given a set of activities to schedule among a large number of

lecture halls, where any activity can take place in any lecture hall. You

wish to schedule all the activities using as few lecture halls as possible.

Image 533

Give an efficient greedy algorithm to determine which activity should

use which lecture hall.

(This problem is also known as the interval-graph coloring problem.

It is modeled by an interval graph whose vertices are the given activities

and whose edges connect incompatible activities. The smallest number

of colors required to color every vertex so that no two adjacent vertices

have the same color corresponds to finding the fewest lecture halls

needed to schedule all of the given activities.)

15.1-5

Consider a modification to the activity-selection problem in which each

activity ai has, in addition to a start and finish time, a value vi. The objective is no longer to maximize the number of activities scheduled,

but instead to maximize the total value of the activities scheduled. That

is, the goal is to choose a set A of compatible activities such that is maximized. Give a polynomial-time algorithm for this

problem.

15.2 Elements of the greedy strategy

A greedy algorithm obtains an optimal solution to a problem by

making a sequence of choices. At each decision point, the algorithm

makes the choice that seems best at the moment. This heuristic strategy

does not always produce an optimal solution, but as in the activity-

selection problem, sometimes it does. This section discusses some of the

general properties of greedy methods.

The process that we followed in Section 15.1 to develop a greedy algorithm was a bit more involved than is typical. It consisted of the

following steps:

1. Determine the optimal substructure of the problem.

2. Develop a recursive solution. (For the activity-selection problem,

we formulated recurrence (15.2), but bypassed developing a

recursive algorithm based solely on this recurrence.)

3. Show that if you make the greedy choice, then only one

subproblem remains.

4. Prove that it is always safe to make the greedy choice. (Steps 3

and 4 can occur in either order.)

5. Develop a recursive algorithm that implements the greedy

strategy.

6. Convert the recursive algorithm to an iterative algorithm.

These steps highlighted in great detail the dynamic-programming

underpinnings of a greedy algorithm. For example, the first cut at the

activity-selection problem defined the subproblems Sij, where both i and j varied. We then found that if you always make the greedy choice, you

can restrict the subproblems to be of the form Sk.

An alternative approach is to fashion optimal substructure with a

greedy choice in mind, so that the choice leaves just one subproblem to

solve. In the activity-selection problem, start by dropping the second

subscript and defining subproblems of the form Sk. Then prove that a

greedy choice (the first activity am to finish in Sk), combined with an optimal solution to the remaining set Sm of compatible activities, yields

an optimal solution to Sk. More generally, you can design greedy

algorithms according to the following sequence of steps:

1. Cast the optimization problem as one in which you make a

choice and are left with one subproblem to solve.

2. Prove that there is always an optimal solution to the original

problem that makes the greedy choice, so that the greedy choice

is always safe.

3. Demonstrate optimal substructure by showing that, having made

the greedy choice, what remains is a subproblem with the

property that if you combine an optimal solution to the

subproblem with the greedy choice you have made, you arrive at

an optimal solution to the original problem.

Later sections of this chapter will use this more direct process.

Nevertheless, beneath every greedy algorithm, there is almost always a

more cumbersome dynamic-programming solution.

How can you tell whether a greedy algorithm will solve a particular

optimization problem? No way works all the time, but the greedy-choice

property and optimal substructure are the two key ingredients. If you

can demonstrate that the problem has these properties, then you are well

on the way to developing a greedy algorithm for it.

Greedy-choice property

The first key ingredient is the greedy-choice property: you can assemble

a globally optimal solution by making locally optimal (greedy) choices.

In other words, when you are considering which choice to make, you

make the choice that looks best in the current problem, without

considering results from subproblems.

Here is where greedy algorithms differ from dynamic programming.

In dynamic programming, you make a choice at each step, but the

choice usually depends on the solutions to subproblems. Consequently,

you typically solve dynamic-programming problems in a bottom-up

manner, progressing from smaller subproblems to larger subproblems.

(Alternatively, you can solve them top down, but memoizing. Of course,

even though the code works top down, you still must solve the

subproblems before making a choice.) In a greedy algorithm, you make

whatever choice seems best at the moment and then solve the

subproblem that remains. The choice made by a greedy algorithm may

depend on choices so far, but it cannot depend on any future choices or

on the solutions to subproblems. Thus, unlike dynamic programming,

which solves the subproblems before making the first choice, a greedy

algorithm makes its first choice before solving any subproblems. A

dynamic-programming algorithm proceeds bottom up, whereas a greedy

strategy usually progresses top down, making one greedy choice after

another, reducing each given problem instance to a smaller one.

Of course, you need to prove that a greedy choice at each step yields

a globally optimal solution. Typically, as in the case of Theorem 15.1,

the proof examines a globally optimal solution to some subproblem. It

then shows how to modify the solution to substitute the greedy choice

for some other choice, resulting in one similar, but smaller, subproblem.

You can usually make the greedy choice more efficiently than when

you have to consider a wider set of choices. For example, in the activity-

selection problem, assuming that the activities were already sorted in

monotonically increasing order by finish times, each activity needed to

be examined just once. By preprocessing the input or by using an

appropriate data structure (often a priority queue), you often can make

greedy choices quickly, thus yielding an efficient algorithm.

Optimal substructure

As we saw in Chapter 14, a problem exhibits optimal substructure if an optimal solution to the problem contains within it optimal solutions to

subproblems. This property is a key ingredient of assessing whether

dynamic programming applies, and it’s also essential for greedy

algorithms. As an example of optimal substructure, recall how Section

15.1 demonstrated that if an optimal solution to subproblem Sij

includes an activity ak, then it must also contain optimal solutions to

the subproblems Sik and Skj. Given this optimal substructure, we argued that if you know which activity to use as ak, you can construct

an optimal solution to Sij by selecting ak along with all activities in optimal solutions to the subproblems Sik and Skj. This observation of

optimal substructure gave rise to the recurrence (15.2) that describes the

value of an optimal solution.

You will usually use a more direct approach regarding optimal

substructure when applying it to greedy algorithms. As mentioned

above, you have the luxury of assuming that you arrived at a

subproblem by having made the greedy choice in the original problem.

All you really need to do is argue that an optimal solution to the

subproblem, combined with the greedy choice already made, yields an

optimal solution to the original problem. This scheme implicitly uses

induction on the subproblems to prove that making the greedy choice at

every step produces an optimal solution.

Greedy versus dynamic programming

Because both the greedy and dynamic-programming strategies exploit

optimal substructure, you might be tempted to generate a dynamic-

programming solution to a problem when a greedy solution suffices or,

conversely, you might mistakenly think that a greedy solution works

when in fact a dynamic-programming solution is required. To illustrate

the subtle differences between the two techniques, let’s investigate two

variants of a classical optimization problem.

The 0-1 knapsack problem is the following. A thief robbing a store

wants to take the most valuable load that can be carried in a knapsack

capable of carrying at most W pounds of loot. The thief can choose to

take any subset of n items in the store. The i th item is worth vi dollars and weighs wi pounds, where vi and wi are integers. Which items should the thief take? (We call this the 0-1 knapsack problem because for each

item, the thief must either take it or leave it behind. The thief cannot

take a fractional amount of an item or take an item more than once.)

In the fractional knapsack problem, the setup is the same, but the

thief can take fractions of items, rather than having to make a binary (0-

1) choice for each item. You can think of an item in the 0-1 knapsack

problem as being like a gold ingot and an item in the fractional

knapsack problem as more like gold dust.

Both knapsack problems exhibit the optimal-substructure property.

For the 0-1 problem, if the most valuable load weighing at most W

pounds includes item j, then the remaining load must be the most

valuable load weighing at most Wwj pounds that the thief can take

from the n − 1 original items excluding item j. For the comparable fractional problem, if if the most valuable load weighing at most W

pounds includes weight w of item j, then the remaining load must be the most valuable load weighing at most Ww pounds that the thief can

take from the n − 1 original items plus wjw pounds of item j.

Although the problems are similar, a greedy strategy works to solve

the fractional knapsack problem, but not the 0-1 problem. To solve the

fractional problem, first compute the value per pound vi/ wi for each item. Obeying a greedy strategy, the thief begins by taking as much as

possible of the item with the greatest value per pound. If the supply of that item is exhausted and the thief can still carry more, then the thief

takes as much as possible of the item with the next greatest value per

pound, and so forth, until reaching the weight limit W. Thus, by sorting

the items by value per pound, the greedy algorithm runs in O( n lg n) time. You are asked to prove that the fractional knapsack problem has

the greedy-choice property in Exercise 15.2-1.

To see that this greedy strategy does not work for the 0-1 knapsack

problem, consider the problem instance illustrated in Figure 15.3(a).

This example has three items and a knapsack that can hold 50 pounds.

Item 1 weighs 10 pounds and is worth $60. Item 2 weighs 20 pounds

and is worth $100. Item 3 weighs 30 pounds and is worth $120. Thus,

the value per pound of item 1 is $6 per pound, which is greater than the

value per pound of either item 2 ($5 per pound) or item 3 ($4 per

pound). The greedy strategy, therefore, would take item 1 first. As you

can see from the case analysis in Figure 15.3(b), however, the optimal solution takes items 2 and 3, leaving item 1 behind. The two possible

solutions that take item 1 are both suboptimal.

For the comparable fractional problem, however, the greedy strategy,

which takes item 1 first, does yield an optimal solution, as shown in

Figure 15.3(c). Taking item 1 doesn’t work in the 0-1 problem, because the thief is unable to fill the knapsack to capacity, and the empty space

lowers the effective value per pound of the load. In the 0-1 problem,

when you consider whether to include an item in the knapsack, you

must compare the solution to the subproblem that includes the item

with the solution to the subproblem that excludes the item before you

can make the choice. The problem formulated in this way gives rise to

many overlapping subproblems—a hallmark of dynamic programming,

and indeed, as Exercise 15.2-2 asks you to show, you can use dynamic

programming to solve the 0-1 problem.

Image 534

Figure 15.3 An example showing that the greedy strategy does not work for the 0-1 knapsack problem. (a) The thief must select a subset of the three items shown whose weight must not exceed 50 pounds. (b) The optimal subset includes items 2 and 3. Any solution with item 1 is suboptimal, even though item 1 has the greatest value per pound. (c) For the fractional knapsack problem, taking the items in order of greatest value per pound yields an optimal solution.

Exercises

15.2-1

Prove that the fractional knapsack problem has the greedy-choice

property.

15.2-2

Give a dynamic-programming solution to the 0-1 knapsack problem

that runs in O( n W) time, where n is the number of items and W is the maximum weight of items that the thief can put in the knapsack.

15.2-3

Suppose that in a 0-1 knapsack problem, the order of the items when

sorted by increasing weight is the same as their order when sorted by

decreasing value. Give an efficient algorithm to find an optimal solution

to this variant of the knapsack problem, and argue that your algorithm

is correct.

15.2-4

Professor Gekko has always dreamed of inline skating across North

Dakota. The professor plans to cross the state on highway U.S. 2, which

Image 535

runs from Grand Forks, on the eastern border with Minnesota, to

Williston, near the western border with Montana. The professor can

carry two liters of water and can skate m miles before running out of

water. (Because North Dakota is relatively flat, the professor does not

have to worry about drinking water at a greater rate on uphill sections

than on flat or downhill sections.) The professor will start in Grand

Forks with two full liters of water. The professor has an official North

Dakota state map, which shows all the places along U.S. 2 to refill water

and the distances between these locations.

The professor’s goal is to minimize the number of water stops along

the route across the state. Give an efficient method by which the

professor can determine which water stops to make. Prove that your

strategy yields an optimal solution, and give its running time.

15.2-5

Describe an efficient algorithm that, given a set { x 1, x 2, … , xn} of points on the real line, determines the smallest set of unit-length closed

intervals that contains all of the given points. Argue that your algorithm

is correct.

15.2-6

Show how to solve the fractional knapsack problem in O( n) time.

15.2-7

You are given two sets A and B, each containing n positive integers. You can choose to reorder each set however you like. After reordering, let ai

be the i th element of set A, and let bi be the i th element of set B. You then receive a payoff of

. Give an algorithm that maximizes your

payoff. Prove that your algorithm maximizes the payoff, and state its

running time, omitting the time for reordering the sets.

15.3 Huffman codes

Huffman codes compress data well: savings of 20% to 90% are typical,

depending on the characteristics of the data being compressed. The data

Image 536

arrive as a sequence of characters. Huffman’s greedy algorithm uses a

table giving how often each character occurs (its frequency) to build up

an optimal way of representing each character as a binary string.

Suppose that you have a 100,000-character data file that you wish to

store compactly and you know that the 6 distinct characters in the file

occur with the frequencies given by Figure 15.4. The character a occurs 45,000 times, the character b occurs 13,000 times, and so on.

You have many options for how to represent such a file of

information. Here, we consider the problem of designing a binary

character code (or code for short) in which each character is represented by a unique binary string, which we call a codeword. If you use a fixed-length code, you need ⌈lg n⌉ bits to represent n ≥ 2 characters. For 6

characters, therefore, you need 3 bits: a = 000, b = 001, c = 010, d =

011, e = 100, and f = 101. This method requires 300,000 bits to encode

the entire file. Can you do better?

Figure 15.4 A character-coding problem. A data file of 100,000 characters contains only the characters a–f, with the frequencies indicated. With each character represented by a 3-bit codeword, encoding the file requires 300,000 bits. With the variable-length code shown, the encoding requires only 224,000 bits.

A variable-length code can do considerably better than a fixed-length

code. The idea is simple: give frequent characters short codewords and

infrequent characters long codewords. Figure 15.4 shows such a code.

Here, the 1-bit string 0 represents a, and the 4-bit string 1100 represents

f. This code requires

(45 · 1 + 13 · 3 + 12 · 3 + 16 · 3 + 9 · 4 + 5 · 4) · 1,000 = 224,000 bits

to represent the file, a savings of approximately 25%. In fact, this is an

optimal character code for this file, as we shall see.

Prefix-free codes

Image 537

We consider here only codes in which no codeword is also a prefix of

some other codeword. Such codes are called prefix-free codes. Although

we won’t prove it here, a prefix-free code can always achieve the optimal

data compression among any character code, and so we suffer no loss of

generality by restricting our attention to prefix-free codes.

Encoding is always simple for any binary character code: just

concatenate the codewords representing each character of the file. For

example, with the variable-length prefix-free code of Figure 15.4, the 4-character file face has the encoding 1100 · 0 · 100 · 1101 =

110001001101, where “·” denotes concatenation.

Prefix-free codes are desirable because they simplify decoding. Since

no codeword is a prefix of any other, the codeword that begins an

encoded file is unambiguous. You can simply identify the initial

codeword, translate it back to the original character, and repeat the

decoding process on the remainder of the encoded file. In our example,

the string 100011001101 parses uniquely as 100 · 0 · 1100 · 1101, which

decodes to cafe.

Figure 15.5 Trees corresponding to the coding schemes in Figure 15.4. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree. All frequencies are in thousands. (a) The tree corresponding to the fixed-length code a = 000, b = 001, c = 010, d = 011, e = 100, f = 101. (b) The tree corresponding to the optimal prefix-free code a = 0, b = 101, c = 100, d = 111, e =

1101, f = 1100.

The decoding process needs a convenient representation for the

prefix-free code so that you can easily pick off the initial codeword. A

Image 538

binary tree whose leaves are the given characters provides one such

representation. Interpret the binary codeword for a character as the

simple path from the root to that character, where 0 means “go to the

left child” and 1 means “go to the right child.” Figure 15.5 shows the trees for the two codes of our example. Note that these are not binary

search trees, since the leaves need not appear in sorted order and

internal nodes do not contain character keys.

An optimal code for a file is always represented by a full binary tree,

in which every nonleaf node has two children (see Exercise 15.3-2). The

fixed-length code in our example is not optimal since its tree, shown in

Figure 15.5(a), is not a full binary tree: it contains codewords beginning with 10, but none beginning with 11. Since we can now restrict our

attention to full binary trees, we can say that if C is the alphabet from

which the characters are drawn and all character frequencies are

positive, then the tree for an optimal prefix-free code has exactly | C |

leaves, one for each letter of the alphabet, and exactly | C | − 1 internal

nodes (see Exercise B.5-3 on page 1175).

Given a tree T corresponding to a prefix-free code, we can compute

the number of bits required to encode a file. For each character c in the

alphabet C, let the attribute c. freq denote the frequency of c in the file and let dT( c) denote the depth of c’s leaf in the tree. Note that dT ( c) is also the length of the codeword for character c. The number of bits required to encode a file is thus

which we define as the cost of the tree T.

Constructing a Huffman code

Huffman invented a greedy algorithm that constructs an optimal prefix-

free code, called a Huffman code in his honor. In line with our

observations in Section 15.2, its proof of correctness relies on the greedy-choice property and optimal substructure. Rather than

demonstrating that these properties hold and then developing

pseudocode, we present the pseudocode first. Doing so will help clarify how the algorithm makes greedy choices.

The procedure HUFFMAN assumes that C is a set of n characters

and that each character cC is an object with an attribute c. freq giving its frequency. The algorithm builds the tree T corresponding to an

optimal code in a bottom-up manner. It begins with a set of | C | leaves

and performs a sequence of | C | − 1 “merging” operations to create the

final tree. The algorithm uses a min-priority queue Q, keyed on the freq

attribute, to identify the two least-frequent objects to merge together.

The result of merging two objects is a new object whose frequency is the

sum of the frequencies of the two objects that were merged.

HUFFMAN( C)

1 n = | C |

2 Q = C

3 for i = 1 to n − 1

4

allocate a new node z

5

x = EXTRACT-MIN( Q)

6

y = EXTRACT-MIN( Q)

7

z. left = x

8

z. right = y

9

z. freq = x. freq + y. freq

10

INSERT( Q, z)

11 return EXTRACT-MIN( Q) // the root of the tree is the only node

left

For our example, Huffman’s algorithm proceeds as shown in Figure

15.6. Since the alphabet contains 6 letters, the initial queue size is n = 6,

and 5 merge steps build the tree. The final tree represents the optimal

prefix-free code. The codeword for a letter is the sequence of edge labels

on the simple path from the root to the letter.

Image 539

Figure 15.6 The steps of Huffman’s algorithm for the frequencies given in Figure 15.4. Each part shows the contents of the queue sorted into increasing order by frequency. Each step merges the two trees with the lowest frequencies. Leaves are shown as rectangles containing a character and its frequency. Internal nodes are shown as circles containing the sum of the frequencies of their children. An edge connecting an internal node with its children is labeled 0 if it is an edge to a left child and 1 if it is an edge to a right child. The codeword for a letter is the sequence of labels on the edges connecting the root to the leaf for that letter. (a) The initial set of n = 6 nodes, one for each letter. (b)–(e) Intermediate stages. (f) The final tree.

The HUFFMAN procedure works as follows. Line 2 initializes the

min-priority queue Q with the characters in C. The for loop in lines 3–

10 repeatedly extracts the two nodes x and y of lowest frequency from

the queue and replaces them in the queue with a new node z

representing their merger. The frequency of z is computed as the sum of

the frequencies of x and y in line 9. The node z has x as its left child and y as its right child. (This order is arbitrary. Switching the left and right

child of any node yields a different code of the same cost.) After n − 1

mergers, line 11 returns the one node left in the queue, which is the root of the code tree.

The algorithm produces the same result without the variables x and

y, assigning the values returned by the EXTRACT-MIN calls directly

to z. left and z. right in lines 7 and 8, and changing line 9 to z. freq =

z. left. freq+ z. right. freq. We’ll use the node names x and y in the proof of correctness, however, so we leave them in.

The running time of Huffman’s algorithm depends on how the min-

priority queue Q is implemented. Let’s assume that it’s implemented as

a binary min-heap (see Chapter 6). For a set C of n characters, the BUILD-MIN-HEAP procedure discussed in Section 6.3 can initialize Q

in line 2 in O( n) time. The for loop in lines 3–10 executes exactly n − 1

times, and since each heap operation runs in O(lg n) time, the loop contributes O( n lg n) to the running time. Thus, the total running time of HUFFMAN on a set of n characters is O( n lg n).

Correctness of Huffman’s algorithm

To prove that the greedy algorithm HUFFMAN is correct, we’ll show

that the problem of determining an optimal prefix-free code exhibits the

greedy-choice and optimal-substructure properties. The next lemma

shows that the greedy-choice property holds.

Lemma 15.2 (Optimal prefix-free codes have the greedy-choice property)

Let C be an alphabet in which each character cC has frequency c. freq. Let x and y be two characters in C having the lowest frequencies.

Then there exists an optimal prefix-free code for C in which the

codewords for x and y have the same length and differ only in the last

bit.

Proof The idea of the proof is to take the tree T representing an arbitrary optimal prefix-free code and modify it to make a tree

representing another optimal prefix-free code such that the characters x

and y appear as sibling leaves of maximum depth in the new tree. In such a tree, the codewords for x and y have the same length and differ

only in the last bit.

Image 540

Image 541

Let a and b be any two characters that are sibling leaves of maximum

depth in T. Without loss of generality, assume that a. freqb. freq and x. freqy. freq. Since x. freq and y. freq are the two lowest leaf frequencies, in order, and a. freq and b. freq are two arbitrary frequencies, in order, we have x. freqa. freq and y. freqb. freq.

In the remainder of the proof, it is possible that we could have x. freq

= a. freq or y. freq = b. freq, but x. freq = b. freq implies that a. freq = b. freq

= x. freq = y. freq (see Exercise 15.3-1), and the lemma would be trivially true. Therefore, assume that x.freqb.freq, which means that xb.

Figure 15.7 An illustration of the key step in the proof of Lemma 15.2. In the optimal tree T, leaves a and b are two siblings of maximum depth. Leaves x and y are the two characters with the lowest frequencies. They appear in arbitrary positions in T. Assuming that xb, swapping leaves a and x produces tree T′, and then swapping leaves b and y produces tree T ″. Since each swap does not increase the cost, the resulting tree T ″ is also an optimal tree.

As Figure 15.7 shows, imagine exchanging the positions in T of a and x to produce a tree T′, and then exchanging the positions in T′ of b and y to produce a tree T″ in which x and y are sibling leaves of maximum depth. (Note that if x = b but ya, then tree T ″ does not have x and y as sibling leaves of maximum depth. Because we assume

that xb, this situation cannot occur.) By equation (15.4), the difference in cost between T and T′ is

because both a. freqx. freq and dT ( a) − dT ( x) are nonnegative. More specifically, a.freqx. freq is nonnegative because x is a minimum-frequency leaf, and dT ( a) − dT ( x) is nonnegative because a is a leaf of maximum depth in T. Similarly, exchanging y and b does not increase the cost, and so B( T′) − B( T ″) is nonnegative. Therefore, B( T ″) ≤ B( T′)

B( T), and since T is optimal, we have B( T) ≤ B( T ″), which implies B( T

″) = B( T). Thus, T ″ is an optimal tree in which x and y appear as sibling leaves of maximum depth, from which the lemma follows.

Lemma 15.2 implies that the process of building up an optimal tree

by mergers can, without loss of generality, begin with the greedy choice

of merging together those two characters of lowest frequency. Why is

this a greedy choice? We can view the cost of a single merger as being

the sum of the frequencies of the two items being merged. Exercise 15.3-

4 shows that the total cost of the tree constructed equals the sum of the

costs of its mergers. Of all possible mergers at each step, HUFFMAN

chooses the one that incurs the least cost.

The next lemma shows that the problem of constructing optimal

prefix-free codes has the optimal-substructure property.

Lemma 15.3 (Optimal prefix-free codes have the optimal-substructure

property)

Let C be a given alphabet with frequency c. freq defined for each character cC. Let x and y be two characters in C with minimum frequency. Let C′ be the alphabet C with the characters x and y removed and a new character z added, so that C′ = ( C − { x, y}) ∪ { z}. Define freq for all characters in C′ with the same values as in C, along with z. freq = x. freq + y. freq. Let T′ be any tree representing an optimal prefix-free code for alphabet C′. Then the tree T, obtained from T′ by replacing the leaf node for z with an internal node having x and y as children, represents an optimal prefix-free code for the alphabet C.

Proof We first show how to express the cost B( T) of tree T in terms of the cost B( T′) of tree T′, by considering the component costs in equation (15.4). For each character cC − { x, y}, we have that dT ( c)

= dT′ ( c), and hence c. freq · dT ( c) = c. freq · dT′ ( c). Since dT ( x) = dT

( y) = dT′ ( z) + 1, we have

x. freq · dT ( x) + y. freq · dT ( y) = ( x. freq + y. freq)( dT′ ( z) + 1)

= z. freq · dT′( z)+ ( x. freq + y. freq), from which we conclude that

B( T) = B( T′) + x. freq + y. freq or, equivalently,

B( T′) = B( T) − x. freqy. freq.

We now prove the lemma by contradiction. Suppose that T does not

represent an optimal prefix-free code for C. Then there exists an optimal

tree T″ such that B( T″) < B( T). Without loss of generality (by Lemma 15.2), T″ has x and y as siblings. Let T″′ be the tree T″ with the common parent of x and y replaced by a leaf z with frequency z. freq = x. freq +

y. freq. Then

B( T‴) = B( T″) − x. freqy. freq

< B( T) − x. freqy. freq

= B( T′),

yielding a contradiction to the assumption that T′ represents an optimal

prefix-free code for C′. Thus, T must represent an optimal prefix-free code for the alphabet C.

Theorem 15.4

Procedure HUFFMAN produces an optimal prefix-free code.

Proof Immediate from Lemmas 15.2 and 15.3.

Exercises

15.3-1

Explain why, in the proof of Lemma 15.2, if x. freq = b. freq, then we must have a. freq = b. freq = x. freq = y. freq.

15.3-2

Prove that a non-full binary tree cannot correspond to an optimal

prefix-free code.

15.3-3

What is an optimal Huffman code for the following set of frequencies,

based on the first 8 Fibonacci numbers?

a:1 b:1 c:2 d:3 e:5 f:8 g:13 h:21

Can you generalize your answer to find the optimal code when the

frequencies are the first n Fibonacci numbers?

15.3-4

Prove that the total cost B( T) of a full binary tree T for a code equals the sum, over all internal nodes, of the combined frequencies of the two

children of the node.

15.3-5

Given an optimal prefix-free code on a set C of n characters, you wish to transmit the code itself using as few bits as possible. Show how to

represent any optimal prefix-free code on C using only 2 n − 1 + n ⌈lg n

bits. ( Hint: Use 2 n − 1 bits to specify the structure of the tree, as discovered by a walk of the tree.)

15.3-6

Generalize Huffman’s algorithm to ternary codewords (i.e., codewords

using the symbols 0, 1, and 2), and prove that it yields optimal ternary

codes.

15.3-7

A data file contains a sequence of 8-bit characters such that all 256

characters are about equally common: the maximum character

frequency is less than twice the minimum character frequency. Prove

that Huffman coding in this case is no more efficient than using an ordinary 8-bit fixed-length code.

15.3-8

Show that no lossless (invertible) compression scheme can guarantee

that for every input file, the corresponding output file is shorter. ( Hint:

Compare the number of possible files with the number of possible

encoded files.)

15.4 Offline caching

Computer systems can decrease the time to access data by storing a

subset of the main memory in the cache: a small but faster memory. A

cache organizes data into cache blocks typically comprising 32, 64, or

128 bytes. You can also think of main memory as a cache for disk-

resident data in a virtual-memory system. Here, the blocks are called

pages, and 4096 bytes is a typical size.

As a computer program executes, it makes a sequence of memory

requests. Say that there are n memory requests, to data in blocks b 1, b 2,

… , bn, in that order. The blocks in the access sequence might not be

distinct, and indeed, any given block is usually accessed multiple times.

For example, a program that accesses four distinct blocks p, q, r, s might make a sequence of requests to blocks s, q, s, q, q, s, p, p, r, s, s, q, p, r, q. The cache can hold up to some fixed number k of cache blocks. It starts out empty before the first request. Each request causes at most

one block to enter the cache and at most one block to be evicted from

the cache. Upon a request for block bi, any one of three scenarios may

occur:

1. Block bi is already in the cache, due to a previous request for the

same block. The cache remains unchanged. This situation is

known as a cache hit.

2. Block bi is not in the cache at that time, but the cache contains

fewer than k blocks. In this case, block bi is placed into the

cache, so that the cache contains one more block than it did before the request.

3. Block bi is not in the cache at that time and the cache is full: it

contains k blocks. Block bi is placed into the cache, but before

that happens, some other block in the cache must be evicted

from the cache in order to make room.

The latter two situations, in which the requested block is not already

in the cache, are called cache misses. The goal is to minimize the number

of cache misses or, equivalently, to maximize the number of cache hits,

over the entire sequence of n requests. A cache miss that occurs while

the cache holds fewer than k blocks—that is, as the cache is first being

filled up—is known as a compulsory miss, since no prior decision could

have kept the requested block in the cache. When a cache miss occurs

and the cache is full, ideally the choice of which block to evict should

allow for the smallest possible number of cache misses over the entire

sequence of future requests.

Typically, caching is an online problem. That is, the computer has to

decide which blocks to keep in the cache without knowing the future

requests. Here, however, let’s consider the offline version of this

problem, in which the computer knows in advance the entire sequence

of n requests and the cache size k, with a goal of minimizing the total number of cache misses.

To solve this offline problem, you can use a greedy strategy called

furthest-in-future, which chooses to evict the block in the cache whose

next access in the request sequence comes furthest in the future.

Intuitively, this strategy makes sense: if you’re not going to need

something for a while, why keep it around? We’ll show that the furthest-

in-future strategy is indeed optimal by showing that the offline caching

problem exhibits optimal substructure and that furthest-in-future has

the greedy-choice property.

Now, you might be thinking that since the computer usually doesn’t

know the sequence of requests in advance, there is no point in studying

the offline problem. Actually, there is. In some situations, you do know

the sequence of requests in advance. For example, if you view the main

memory as the cache and the full set of data as residing on disk (or a solid-state drive), there are algorithms that plan out the entire set of

reads and writes in advance. Furthermore, we can use the number of

cache misses produced by an optimal algorithm as a baseline for

comparing how well online algorithms perform. We’ll do just that in

Section 27.3.

Offline caching can even model real-world problems. For example,

consider a scenario where you know in advance a fixed schedule of n

events at known locations. Events may occur at a location multiple

times, not necessarily consecutively. You are managing a group of k

agents, you need to ensure that you have one agent at each location

when an event occurs, and you want to minimize the number of times

that agents have to move. Here, the agents are like the blocks, the events

are like the requests, and moving an agent is akin to a cache miss.

Optimal substructure of offline caching

To show that the offline problem exhibits optimal substructure, let’s

define the subproblem ( C, i) as processing requests for blocks bi, bi+1,

… , bn with cache configuration C at the time that the request for block bi occurs, that is, C is a subset of the set of blocks such that | C | ≤ k. A solution to subproblem ( C, i) is a sequence of decisions that specifies which block to evict (if any) upon each request for blocks bi, bi+1, … , bn. An optimal solution to subproblem ( C, i) minimizes the number of cache misses.

Consider an optimal solution S to subproblem ( C, i), and let C′ be the contents of the cache after processing the request for block bi in solution S. Let S′ be the subsolution of S for the resulting subproblem ( C′, i + 1). If the request for bi results in a cache hit, then the cache remains unchanged, so that C′ = C. If the request for block bi results in a cache miss, then the contents of the cache change, so that C′ ≠ C. We claim that in either case, S′ is an optimal solution to subproblem ( C′, i +

1). Why? If S′ is not an optimal solution to subproblem ( C′, i + 1), then there exists another solution S″ to subproblem ( C′, i + 1) that makes fewer cache misses than S′. Combining S″ with the decision of S at the

Image 542

request for block bi yields another solution that makes fewer cache

misses than S, which contradicts the assumption that S is an optimal solution to subproblem ( C, i).

To quantify a recursive solution, we need a little more notation. Let

RC, i be the set of all cache configurations that can immediately follow configuration C after processing a request for block bi. If the request results in a cache hit, then the cache remains unchanged, so that RC, i =

{ C }. If the request for bi results in a cache miss, then there are two possibilities. If the cache is not full (| C | < k), then the cache is filling up and the only choice is to insert bi into the cache, so that RC, i= { C

{ bi}}. If the cache is full (| C | = k) upon a cache miss, then RC, i contains k potential configurations: one for each candidate block in C

that could be evicted and replaced by block bi. In this case, RC, i = {( C

− { x}) ∪ { bi} : xC }. For example, if C = { p, q, r}, k = 3, and block s is requested, then RC, i = {{ p, q, s},{ p, r, s},{ q, r, s}}.

Let miss( C, i) denote the minimum number of cache misses in a solution for subproblem ( C, i). Here is a recurrence for miss( C, i): Greedy-choice property

To prove that the furthest-in-future strategy yields an optimal solution,

we need to show that optimal offline caching exhibits the greedy-choice

property. Combined with the optimal-substructure property, the greedy-

choice property will prove that furthest-in-future produces the

minimum possible number of cache misses.

Theorem 15.5 (Optimal offline caching has the greedy-choice property)

Consider a subproblem ( C, i) when the cache C contains k blocks, so that it is full, and a cache miss occurs. When block bi is requested, let z

= bm be the block in C whose next access is furthest in the future. (If some block in the cache will never again be referenced, then consider

any such block to be block z, and add a dummy request for block z =

bm = bn+1.) Then evicting block z upon a request for block bi is included in some optimal solution for the subproblem ( C, i).

Proof Let S be an optimal solution to subproblem ( C, i). If S evicts block z upon the request for block bi, then we are done, since we have

shown that some optimal solution includes evicting z.

So now suppose that optimal solution S evicts some other block x

when block bi is requested. We’ll construct another solution S′ to subproblem ( C, i) which, upon the request for bi, evicts block z instead of x and induces no more cache misses than S does, so that S′ is also optimal. Because different solutions may yield different cache

configurations, denote by CS, j the configuration of the cache under solution S just before the request for some block bj, and likewise for solution S′ and CS′, j. We’ll show how to construct S′ with the following properties:

1. For j = i + 1, … , m, let Dj = CS, jCS′, j. Then, | Dj | ≥ k − 1, so that the cache configurations CS, j and CS′, j differ by at most one block. If they differ, then CS, j = Dj ∪ { z} and CS′, j = Dj

{ y} for some block yz.

2. For each request of blocks bi, … , bm−1, if solution S has a cache hit, then solution S′ also has a cache hit.

3. For all j > m, the cache configurations CS, j and CS′, j are identical.

4. Over the sequence of requests for blocks bi, … , bm, the number

of cache misses produced by solution S′ is at most the number of

cache misses produced by solution S.

We’ll prove inductively that these properties hold for each request.

1. We proceed by induction on j, for j = i +1, … , m. For the base case, the initial caches CS, i and CS′, i are identical. Upon the request for block bi, solution S evicts x and solution S′ evicts z.

Thus, cache configurations CS, i+1 and CS′, i+1 differ by just one block, CS, i+1 = Di+1 ∪ { z}, CS′, i+1 = Di+1 ∪ { x}, and xz.

The inductive step defines how solution S′ behaves upon a

request for block bj for i + 1 ≤ jm − 1. The inductive hypothesis is that property 1 holds when bj is requested. Because

z = bm is the block in CS, i whose next reference is furthest in the future, we know that bjz. We consider several scenarios:

If CS, j = CS′, j (so that | Dj | = k), then solution S′ makes the same decision upon the request for bj as S makes, so

that CS, j+1 = CS′, j+1.

If | Dj| = k − 1 and bjDj, then both caches already contain block bj, and both solutions S and S′ have cache

hits. Therefore, CS, j+1 = CS, j and CS′, j+1 = CS′, j.

If | Dj | = k − 1 and bjDj, then because CS, j = Dj ∪ { z}

and bjz, solution S has a cache miss. It evicts either block z or some block wDj.

If solution S evicts block z, then CS, j+1 = Dj ∪ { bj}.

There are two cases, depending on whether bj = y:

If bj = y, then solution S′ has a cache hit, so

that CS′, j+1 = CS′, j = Dj ∪ { bj}. Thus, CS, j+1

= CS′, j +1.

If bjy, then solution S′ has a cache miss. It

evicts block y, so that CS′, j+1 = Dj ∪ { bj }, and again CS, j+1 = CS′, j+1.

If solution S evicts some block wDj, then CS, j+1

= ( Dj − { w}) ∪ { bj, z}. Once again, there are two

cases, depending on whether bj = y:

If bj = y, then solution S′ has a cache hit, so

that CS′, j+1 = CS′, j = Dj ∪ { bj}. Since wDj and w was not evicted by solution S′, we have

wCS′, j +1. Therefore, wDj+1 and bj

Dj+1, so that Dj+1 = ( Dj − { w}) ∪ { bj }. Thus, CS, j+1 = Dj+1 ∪ { z}, CS′, j+1 = Dj +1 ∪ { w}, and because wz, property 1 holds when

block bj+1 is requested. (In other words, block

w replaces block y in property 1.)

If bjy, then solution S′ has a cache miss. It

evicts block w, so that CS′, j +1 = ( Dj − { w}) ∪

{ bj, y}. Therefore, we have that Dj+1 = ( Dj

{ w}) ∪ { bj } and so CS, j+1 = Dj+1 ∪ { z} and CS′, j+1 = Dj +1 ∪ { y}.

2. In the above discussion about maintaining property 1, solution S

may have a cache hit in only the first two cases, and solution S