races, the P-MAT-VEC-WRONG procedure on the next page is a faulty

parallel implementation of matrix-vector multiplication that achieves a

span of Θ(lg n) by parallelizing the inner for loop. This procedure is incorrect, unfortunately, due to determinacy races when updating yi in

line 3, which executes in parallel for all n values of j.

Index variables of parallel for loops, such as i in line 1 and j in line 2, do not cause races between iterations. Conceptually, each iteration of

the loop creates an independent variable to hold the index of that

iteration during that iteration’s execution of the loop body. Even if two

parallel iterations both access the same index variable, they really are

accessing different variable instances—hence different memory

locations—and no race occurs.

P-MAT-VEC-WRONG ( A, x, y, n)

1 parallel for i = 1 to n

2

parallel for j = 1 to n

3

yi = yi + aijxj

// determinacy race

A parallel algorithm with races can sometimes be deterministic. As

an example, two parallel threads might store the same value into a

shared variable, and it wouldn’t matter which stored the value first. For

simplicity, however, we generally prefer code without determinacy races,

even if the races are benign. And good parallel programmers frown on

code with determinacy races that cause nondeterministic behavior, if

deterministic code that performs comparably is an option.

But nondeterministic code does have its place. For example, you

can’t implement a parallel hash table, a highly practical data structure,

without writing code containing determinacy races. Much research has

centered around how to extend the fork-join model to incorporate

limited “structured” nondeterminism while avoiding the full measure of

complications that arise when nondeterminism is completely

unrestricted.

A chess lesson

To illustrate the power of work/span analysis, this section closes with a

true story that occurred during the development of one of the first

world-class parallel chess-playing programs [106] many years ago. The timings below have been simplified for exposition.

The chess program was developed and tested on a 32-processor

computer, but it was designed to run on a supercomputer with 512

processors. Since the supercomputer availability was limited and

expensive, the developers ran benchmarks on the small computer and

extrapolated performance to the large computer.

At one point, the developers incorporated an optimization into the

program that reduced its running time on an important benchmark on

Image 876

Image 877

the small machine from T 32 = 65 seconds to

seconds. Yet, the

developers used the work and span performance measures to conclude

that the optimized version, which was faster on 32 processors, would

actually be slower than the original version on the 512 processors of the

large machine. As a result, they abandoned the “optimization.”

Here is their work/span analysis. The original version of the program

had work T 1 = 2048 seconds and span T∞= 1 second. Let’s treat inequality (26.4) on page 760 as the equation TP = T 1/ P + T∞, which we can use as an approximation to the running time on P processors.

Then indeed we have T 32 = 2048/32 + 1 = 65. With the optimization,

the work becomes T′1 = 1024 seconds, and the span becomes T′∞ = 8

seconds. Our approximation gives T′32 = 1024/32 + 8 = 40.

The relative speeds of the two versions switch when we estimate their

running times on 512 processors, however. The first version has a

running time of T 512 = 2048/512+1 = 5 seconds, and the second version

runs in

seconds. The optimization that speeds up

the program on 32 processors makes the program run for twice as long

on 512 processors! The optimized version’s span of 8, which is not the

dominant term in the running time on 32 processors, becomes the

dominant term on 512 processors, nullifying the advantage from using

more processors. The optimization does not scale up.

The moral of the story is that work/span analysis, and measurements

of work and span, can be superior to measured running times alone in

extrapolating an algorithm’s scalability.

Exercises

26.1-1

What does a trace for the execution of a serial algorithm look like?

26.1-2

Suppose that line 4 of P-FIB spawns P-FIB ( n − 2), rather than calling

it as is done in the pseudocode. How would the trace of P-FIB(4) in

Figure 26.2 change? What is the impact on the asymptotic work, span, and parallelism?

Image 878

26.1-3

Draw the trace that results from executing P-FIB(5). Assuming that

each strand in the computation takes unit time, what are the work,

span, and parallelism of the computation? Show how to schedule the

trace on 3 processors using greedy scheduling by labeling each strand

with the time step in which it is executed.

26.1-4

Prove that a greedy scheduler achieves the following time bound, which

is slightly stronger than the bound proved in Theorem 26.1:

26.1-5

Construct a trace for which one execution by a greedy scheduler can

take nearly twice the time of another execution by a greedy scheduler on

the same number of processors. Describe how the two executions would

proceed.

26.1-6

Professor Karan measures her deterministic task-parallel algorithm on

4, 10, and 64 processors of an ideal parallel computer using a greedy

scheduler. She claims that the three runs yielded T 4 = 80 seconds, T 10 =

42 seconds, and T 64 = 10 seconds. Argue that the professor is either lying or incompetent. ( Hint: Use the work law (26.2), the span law

(26.3), and inequality (26.5) from Exercise 26.1-4.)

26.1-7

Give a parallel algorithm to multiply an n × n matrix by an n-vector that achieves Θ( n 2/lg n) parallelism while maintaining Θ( n 2) work.

26.1-8

Analyze the work, span, and parallelism of the procedure P-

TRANSPOSE, which transposes an n × n matrix A in place.

P-TRANSPOSE ( A, n)

1 parallel for j = 2 to n

2

parallel for i = 1 to j − 1

3

exchange aij with aji

26.1-9

Suppose that instead of a parallel for loop in line 2, the P-TRANSPOSE

procedure in Exercise 26.1-8 had an ordinary for loop. Analyze the

work, span, and parallelism of the resulting algorithm.

26.1-10

For what number of processors do the two versions of the chess

program run equally fast, assuming that TP = T 1/ P + T∞?

26.2 Parallel matrix multiplication

In this section, we’ll explore how to parallelize the three matrix-

multiplication algorithms from Sections 4.1 and 4.2. We’ll see that each algorithm can be parallelized in a straightforward fashion using either

parallel loops or recursive spawning. We’ll analyze them using

work/span analysis, and we’ll see that each parallel algorithm attains the

same performance on one processor as its corresponding serial

algorithm, while scaling up to large numbers of processors.

A parallel algorithm for matrix multiplication using parallel loops

The first algorithm we’ll study is P-MATRIX-MULTIPLY, which

simply parallelizes the two outer loops in the procedure MATRIX-

MULTIPLY on page 81.

P-MATRIX-MULTIPLY ( A, B, C, n)

1 parallel for i = 1 to n

// compute entries in each of n rows

2

parallel for j = 1 to n

// compute n entries in row i

3

for k = 1 to n

4

cij = cij + aik · bkj // add in another term of equation (4.1)

Let’s analyze P-MATRIX-MULTIPLY. Since the serial projection of

the algorithm is just MATRIX-MULTIPLY, the work is the same as the

running time of MATRIX-MULTIPLY: T 1( n) = Θ( n 3). The span is T∞( n) = Θ( n), because it follows a path down the tree of recursion for the parallel for loop starting in line 1, then down the tree of recursion

for the parallel for loop starting in line 2, and then executes all n

iterations of the ordinary for loop starting in line 3, resulting in a total

span of Θ(lg n) + Θ(lg n) + Θ( n) = Θ( n). Thus the parallelism is Θ( n 3)/

Θ( n) = Θ( n 2). (Exercise 26.2-3 asks you to parallelize the inner loop to obtain a parallelism of Θ( n 3/lg n), which you cannot do

straightforwardly using parallel for, because you would create races.)

A parallel divide-and-conquer algorithm for matrix multiplication

Section 4.1 shows how to multiply n × n matrices serially in Θ( n 3) time using a divide-and-conquer strategy. Let’s see how to parallelize that

algorithm using recursive spawning instead of calls.

The serial MATRIX-MULTIPLY-RECURSIVE procedure on page

83 takes as input three n × n matrices A, B, and C and performs the matrix calculation C = C + A · B by recursively performing eight multiplications of n/2 × n/2 submatrices of A and B. The P-MATRIX-MULTIPLY-RECURSIVE procedure on the following page

implements the same divide-and-conquer strategy, but it uses spawning

to perform the eight multiplications in parallel. To avoid determinacy

races in updating the elements of C, it creates a temporary matrix D to store four of the submatrix products. At the end, it adds C and D

together to produce the final result. (Problem 26-2 asks you to eliminate

the temporary matrix D at the expense of some parallelism.)

Lines 2–3 of P-MATRIX-MULTIPLY-RECURSIVE handle the

base case of multiplying 1 × 1 matrices. The remainder of the procedure

deals with the recursive case. Line 4 allocates a temporary matrix D, and lines 5–7 zero it. Line 8 partitions each of the four matrices A, B, C, and D into n/2 × n/2 submatrices. (As with MATRIX-MULTIPLY-RECURSIVE on page 83, we’re glossing over the subtle issue of how to

use index calculations to represent submatrix sections of a matrix.) The spawned recursive call in line 9 sets C 11 = C 11 + A 11 · B 11, so that C 11

accumulates the first of the two terms in equation (4.5) on page 82.

Similarly, lines 10–12 cause each of C 12, C 21, and C 22 in parallel to accumulate the first of the two terms in equations (4.6)–(4.8),

respectively. Line 13 sets the submatrix D 11 to the submatrix product

A 12 · B 21, so that D 11 equals the second of the two terms in equation (4.5). Lines 14–16 set each of D 12, D 21, and D 22 in parallel to the second of the two terms in equations (4.6)–(4.8), respectively. The sync

statement in line 17 ensures that all the spawned submatrix products in

lines 9–16 have been computed, after which the doubly nested parallel

for loops in lines 18–20 add the elements of D to the corresponding elements of C.

P-MATRIX-MULTIPLY-RECURSIVE ( A, B, C, n)

1 if n == 1

// just one element in each matrix?

2

c 11 = c 11 + a 11 · b 11

3

return

4 let D be a new n × n matrix// temporary matrix

5 parallel for i = 1 to n

// set D = 0

6

parallel for j = 1 to n

7

dij = 0

8 partition A, B, C, and D into n/2 × n/2 submatrices A 11, A 12, A 21, A 22; B 11, B 12, B 21, B 22; C 11, C 12, C 21, C 22; and D 11, D 12, D 21, D 22; respectively

9 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 11, B 11, C 11,

n/2)

10 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 11, B 12, C 12,

n/2)

11 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 21, B 11, C 21,

n/2)

12 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 21, B 12, C 22,

n/2)

13 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 12, B 21, D 11,

n/2)

14 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 12, B 22, D 12,

n/2)

15 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 22, B 21, D 21,

n/2)

16 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 22, B 22, D 22,

n/2)

17 sync

// wait for spawned submatrix products

18 parallel for i = 1 to n

// update C = C + D

19

parallel for j = 1 to n

20

cij = cij + dij

Let’s

analyze

the

P-MATRIX-MULTIPLY-RECURSIVE

procedure. We start by analyzing the work M 1( n), echoing the serial running-time analysis of its progenitor MATRIX-MULTIPLY-RECURSIVE. The recursive case allocates and zeros the temporary

matrix D in Θ( n 2) time, partitions in Θ(1) time, performs eight recursive multiplications of n/2 × n/2 matrices, and finishes up with the Θ( n 2) work from adding two n× n matrices. Thus the work outside the

spawned recursive calls is Θ( n 2), and the recurrence for the work M 1( n) becomes

M 1( n) = 8 M 1( n/2) + Θ( n 2)

= Θ( n 3)

by case 1 of the master theorem (Theorem 4.1). Not surprisingly, the

work of this parallel algorithm is asymptotically the same as the

running time of the procedure MATRIX-MULTIPLY on page 81, with

its triply nested loops.

Let’s determine the span M∞( n) of P-MATRIX-MULTIPLY-

RECURSIVE. Because the eight parallel recursive spawns all execute

Image 879

on matrices of the same size, the maximum span for any recursive spawn

is just the span of a single one of them, or M∞( n/2). The span for the

doubly nested parallel for loops in lines 5–7 is Θ(lg n) because each loop

control adds Θ(lg n) to the constant span of line 7. Similarly, the doubly

nested parallel for loops in lines 18–20 add another Θ(lg n). Matrix partitioning by index calculation has Θ(1) span, which is dominated by

the Θ(lg n) span of the nested loops. We obtain the recurrence

Since this recurrence falls under case 2 of the master theorem with k =

1, the solution is M∞( n) = Θ(lg2 n).

The parallelism of P-MATRIX-MULTIPLY-RECURSIVE is

M 1( n)/ M∞( n) = Θ( n 3/lg2 n), which is huge. (Problem 26-2 asks you to simplify this parallel algorithm at the expense of just a little less

parallelism.)

Parallelizing Strassen’s method

To parallelize Strassen’s algorithm, we can follow the same general

outline as on pages 86–87, but use spawning. You may find it helpful to

compare each step below with the corresponding step there. We’ll

analyze costs as we go along to develop recurrences T 1( n) and T∞( n) for the overall work and span, respectively.

1. If n = 1, the matrices each contain a single element. Perform a

single scalar multiplication and a single scalar addition, and

return. Otherwise, partition the input matrices A and B and

output matrix C into n/2 × n/2 submatrices, as in equation (4.2) on page 82. This step takes Θ(1) work and Θ(1) span by index

calculation.

2. Create n/2 × n/2 matrices S 1, S 2, … , S 10, each of which is the sum or difference of two submatrices from step 1. Create and

zero the entries of seven n/2× n/2 matrices P 1, P 2, … , P 7 to hold seven n/2× n/2 matrix products. All 17 matrices can be created,

and the Pi initialized, with doubly nested parallel for loops using Θ( n 2) work and Θ(lg n) span.

3. Using the submatrices from step 1 and the matrices S 1, S 2, … ,

S 10 created in step 2, recursively spawn computations of each of

the seven n/2 × n/2 matrix products P 1, P 2, … , P 7, taking 7 T 1( n/2) work and T∞( n/2) span.

4. Update the four submatrices C 11, C 12, C 21, C 22 of the result matrix C by adding or subtracting various Pi matrices. Using

doubly nested parallel for loops, computing all four submatrices

takes Θ( n 2) work and Θ(lg n) span.

Let’s analyze this algorithm. Since the serial projection is the same as

the original serial algorithm, the work is just the running time of the

serial projection, namely, Θ( n lg 7). As we did with P-MATRIX-

MULTIPLY-RECURSIVE, we can devise a recurrence for the span. In

this case, seven recursive calls execute in parallel, but since they all

operate on matrices of the same size, we obtain the same recurrence

(26.6) as we did for P-MATRIX-MULTIPLY-RECURSIVE, with

solution Θ(lg2 n). Thus the parallel version of Strassen’s method has parallelism Θ( n lg 7/lg2 n), which is large. Although the parallelism is slightly less than that of P-MATRIX-MULTIPLY-RECURSIVE, that’s

just because the work is also less.

Exercises

26.2-1

Draw the trace for computing P-MATRIX-MULTIPLY on 2 × 2

matrices, labeling how the vertices in your diagram correspond to

strands in the execution of the algorithm. Assuming that each strand

executes in unit time, analyze the work, span, and parallelism of this

computation.

26.2-2

Repeat Exercise 26.2-1 for P-MATRIX-MULTIPLY-RECURSIVE.

26.2-3

Give pseudocode for a parallel algorithm that multiplies two n × n matrices with work Θ( n 3) but span only Θ(lg n). Analyze your algorithm.

26.2-4

Give pseudocode for an efficient parallel algorithm that multiplies a p ×

q matrix by a q × r matrix. Your algorithm should be highly parallel even if any of p, q, and r equal 1. Analyze your algorithm.

26.2-5

Give pseudocode for an efficient parallel version of the Floyd-Warshall

algorithm (see Section 23.2), which computes shortest paths between all pairs of vertices in an edge-weighted graph. Analyze your algorithm.

26.3 Parallel merge sort

We first saw serial merge sort in Section 2.3.1, and in Section 2.3.2 we analyzed its running time and showed it to be Θ( n lg n). Because merge

sort already uses the divide-and-conquer method, it seems like a terrific

candidate for implementing using fork-join parallelism.

The procedure P-MERGE-SORT modifies merge sort to spawn the

first recursive call. Like its serial counterpart MERGE-SORT on page

39, the P-MERGE-SORT procedure sorts the subarray A[ p : r]. After the sync statement in line 8 ensures that the two recursive spawns in

lines 5 and 7 have finished, P-MERGE-SORT calls the P-MERGE

procedure, a parallel merging algorithm, which is on page 779, but you

don’t need to bother looking at it right now.

P-MERGE-SORT ( A, p, r)

1 if pr

// zero or one element?

2

return

3 q = ⌊( p + r)/2⌊

// midpoint of A[ p : r]

4 // Recursively sort A[ p : q] in parallel.

5 spawn P-MERGE-SORT ( A, p, q)

6 // Recursively sort A[ q + 1 : r] in parallel.

7 spawn P-MERGE-SORT ( A, q + 1, r)

8 sync

// wait for spawns

9 // Merge A[ p : q] and A[ q + 1 : r] into A[ p : r].

10 P-MERGE ( A, p, q, r)

First, let’s use work/span analysis to get some intuition for why we

need a parallel merge procedure. After all, it may seem as though there

should be plenty of parallelism just by parallelizing MERGE-SORT

without worrying about parallelizing the merge. But what would happen

if the call to P-MERGE in line 10 of P-MERGE-SORT were replaced

by a call to the serial MERGE procedure on page 36? Let’s call the

pseudocode so modified P-NAIVE-MERGE-SORT.

Let T 1( n) be the (worst-case) work of P-NAIVE-MERGE-SORT on

an n-element subarray, where n = rp + 1 is the number of elements in A[ p : r], and let T∞( n) be the span. Because MERGE is serial with running time Θ( n), both its work and span are Θ( n). Since the serial projection of P-NAIVE-MERGE-SORT is exactly MERGE-SORT, its

work is T 1( n) = Θ( n lg n). The two recursive calls in lines 5 and 7 run in parallel, and so its span is given by the recurrence

T∞( n) = T∞( n/2) + Θ( n)

= Θ( n),

by case 1 of the master theorem. Thus the parallelism of P-NAIVE-

MERGE-SORT is T 1( n)/ T∞( n) = Θ(lg n), which is an unimpressive amount of parallelism. To sort a million elements, for example, since lg

106 ≈ 20, it might achieve linear speedup on a few processors, but it

would not scale up to dozens of processors.

The parallelism bottleneck in P-NAIVE-MERGE-SORT is plainly

the MERGE procedure. If we asymptotically reduce the span of

merging, the master theorem dictates that the span of parallel merge

sort will also get smaller. When you look at the pseudocode for

MERGE, it may seem that merging is inherently serial, but it’s not. We can fashion a parallel merging algorithm. The goal is to reduce the span

of parallel merging asymptotically, but if we want an efficient parallel

algorithm, we must ensure that the Θ( n) bound on work doesn’t

increase.

Figure 26.6 depicts the divide-and-conquer strategy that we’ll use in P-MERGE. The heart of the algorithm is a recursive auxiliary

procedure P-MERGE-AUX that merges two sorted subarrays of an

array A into a subarray of another array B in parallel. Specifically, P-MERGE-AUX merges A[ p 1 : r 1] and A[ p 2 : r 2] into subarray B[ p 3 : r 3], where r 3 = p 3 + ( r 1 − p 1 + 1) + ( r 2 − p 2 + 1) − 1 = p 3 + ( r 1 − p 1) + ( r 2

p 2) + 1.

The key idea of the recursive merging algorithm in P-MERGE-AUX

is to split each of the two sorted subarrays of A around a pivot x, such that all the elements in the lower part of each subarray are at most x

and all the elements in the upper part of each subarray are at least x.

The procedure can then recurse in parallel on two subtasks: merging the

two lower parts, and merging the two upper parts. The trick is to find a

pivot x so that the recursion is not too lopsided. We don’t want a situation such as that in QUICKSORT on page 183, where bad

partitioning elements lead to a dramatic loss of asymptotic efficiency.

We could opt to partition around a random element, as

RANDOMIZED-QUICKSORT on page 192 does, but because the

input subarrays are sorted, P-MERGE-AUX can quickly determine a

pivot that always works well.

Specifically, the recursive merging algorithm picks the pivot x as the

middle element of the larger of the two input subarrays, which we can

assume without loss of generality is A[ p 1 : r 1], since otherwise, the two subarrays can just switch roles. That is, x = A[ q 1], where q 1 = ⌊( p 1 +

r 1)/2⌊. Because A[ p 1 : r 1] is sorted, x is a median of the subarray elements: every element in A[ p 1 : q 1 − 1] is no more than x, and every element in A[ q 1 + 1 : r 1] is no less than x. Then the algorithm finds the

“split point” q 2 in the smaller subarray A[ p 2 : r 2] such that all the

Image 880

elements in A[ p 2 : q 2−1] (if any) are at most x and all the elements in A[ q 2 : r 2] (if any) are at least x. Intuitively, the subarray A[ p 2 : r 2] would still be sorted if x were inserted between A[ q 2−1] and A[ q 2] (although the algorithm doesn’t do that). Since A[ p 2 : r 2] is sorted, a minor variant of binary search (see Exercise 2.3-6) with x as the search key can find

the split point q 2 in Θ(lg n) time in the worst case. As we’ll see when we get to the analysis, even if x splits A[ p 2 : r 2] badly— x is either smaller than all the subarray elements or larger—we’ll still have at least 1/4 of

the elements in each of the two recursive merges. Thus the larger of the

recursive merges operates on at most 3/4 elements, and the recursion is

guaranteed to bottom out after Θ(lg n) recursive calls.

Figure 26.6 The idea behind P-MERGE-AUX, which merges two sorted subarrays A[ p 1 : r 1]

and A[ p 2 : r 2] into the subarray B[ p 3 : r 3] in parallel. Letting x = A[ q 1] (shown in yellow) be a median of A[ p 1 : r 1] and q 2 be a place in A[ p 2 : r 2] such that x would fall between A[ q 2 − 1] and A[ q 2], every element in the subarrays A[ p 1 : q 1 − 1] and A[ p 2 : q 2 − 1] (shown in orange) is at most x, and every element in the subarrays A[ q 1 + 1 : r 1] and A[ q 2 + 1 : r 2] (shown in blue) is at least x. To merge, compute the index q 3 where x belongs in B[ p 3 : r 3], copy x into B[ q 3], and then recursively merge A[ p 1 : q 1 − 1] with A[ p 2 : q 2 − 1] into B[ p 3 : q 3 − 1] and A[ q 1 + 1 : r 1]

with A[ q 2 : r 2] into B[ q 3 + 1 : r 3].

Now let’s put these ideas into pseudocode. We start with the serial

procedure FIND-SPLIT-POINT ( A, p, r, x) on the next page, which takes as input a sorted subarray A[ p : r] and a key x. The procedure returns a split point of A[ p : r]: an index q in the range pqr + 1 such

that all the elements in A[ p : q − 1] (if any) are at most x and all the elements in A[ q : r] (if any) are at least x.

The FIND-SPLIT-POINT procedure uses binary search to find the

split point. Lines 1 and 2 establish the range of indices for the search.

Each time through the while loop, line 5 compares the middle element of

the range with the search key x, and lines 6 and 7 narrow the search range to either the lower half or the upper half of the subarray,

depending on the result of the test. In the end, after the range has been

narrowed to a single index, line 8 returns that index as the split point.

FIND-SPLIT-POINT ( A, p, r, x)

1 low = p

// low end of search range

2 high = r + 1

// high end of search range

3 while low < high

// more than one element?

4

mid = ⌊( low + high)/2⌊ // midpoint of range

5

if xA[ mid]

// is answer qmid?

6

high = mid

// narrow search to A[ low : mid]

7

else low = mid + 1

// narrow search to A[ mid + 1 : high]

8 return low

Because FIND-SPLIT-POINT contains no parallelism, its span is

just its serial running time, which is also its work. On a subarray A[ p : r]

of size n = rp + 1, each iteration of the while loop halves the search range, which means that the loop terminates after Θ(lg n) iterations.

Since each iteration takes constant time, the algorithm runs in Θ(lg n)

(worst-case) time. Thus the procedure has work and span Θ(lg n).

Let’s now look at the pseudocode for the parallel merging procedure

P-MERGE on the next page. Most of the pseudocode is devoted to the

recursive procedure P-MERGE-AUX. The procedure P-MERGE itself

is just a “wrapper” that sets up for P-MERGE-AUX. It allocates a new

array B[ p : r] to hold the output of P-MERGE-AUX in line 1. It then calls P-MERGE-AUX in line 2, passing the indices of the two subarrays

to be merged and providing B as the output destination of the merged

result, starting at index p. After P-MERGE-AUX returns, lines 3–4

perform a parallel copy of the output B[ p : r] into the subarray A[ p : r], which is where P-MERGE-SORT expects it.

The P-MERGE-AUX procedure is the interesting part of the

algorithm. Let’s start by understanding the parameters of this recursive

parallel procedure. The input array A and the four indices p 1, r 1, p 2, r 2

specify the subarrays A[ p 1 : r 1] and A[ p 2 : r 2] to be merged. The array B

and the index p 3 indicate that the merged result should be stored into

B[ p 3 : r 3], where r 3 = p 3 + ( r 1 − p 1)+ ( r 2 − p 2)+ 1, as we saw earlier.

The end index r 3 of the output subarray is not needed by the

pseudocode, but it helps conceptually to name the end index, as in the

comment in line 13.

The procedure begins by checking the base case of the recursion and

doing some bookkeeping to simplify the rest of the pseudocode. Lines 1

and 2 test whether the two subarrays are both empty, in which case the

procedure returns. Line 3 checks whether the first subarray contains

fewer elements than the second subarray. Since the number of elements

in the first subarray is r 1 − p 1 + 1 and the number in the second subarray is r 2 − p 2 + 1, the test omits the two “+1’s.” If the first subarray is the smaller of the two, lines 4 and 5 switch the roles of the

subarrays so that A[ p 1, r 1] refers to the larger subarray for the balance of the procedure.

P-MERGE ( A, p, q, r)

1 let B[ p : r] be a new array

// allocate scratch array

2 P-MERGE-AUX ( A, p, q, q + 1, r, B,// merge from A into B

p)

3 parallel for i = p to r

// copy B back to A in

parallel

4

A[ i] = B[ i]

P-MERGE-AUX ( A, p 1, r 1, p 2, r 2, B, p 3) 1 if p 1 > r 1 and p 2 > r 2

// are both subarrays empty?

2

return

3 if r 1 − p 1 < r 2 − p 2

// second subarray bigger?

4

exchange p 1 with p 2

// swap subarray roles

5

exchange r 1 with r 2

6 q 1 = ⌊( p 1 + r 1)/2⌊

// midpoint of A[ p 1 : r 1]

7 x = A[ q 1]

// median of A[ p 1 : r 1] is

pivot x

8 q 2 = FIND-SPLIT-POINT ( A, p 2,// split A[ p 2 : r 2] around x r 2, x)

9 q 3 = p 3 + ( q 1 − p 1) + ( q 2 − p 2)

// where x belongs in B

10 B[ q 3] = x

// … put it there

11 // Recursively merge A[ p 1 : q 1 − 1] and A[ p 2 : q 2 − 1] into B[ p 3 : q 3

− 1].

12 spawn P-MERGE-AUX ( A, p 1, q 1 − 1, p 2, q 2 − 1, B, p 3) 13 // Recursively merge A[ q 1 + 1 : r 1] and A[ q 2 : r 2] into B[ q 3 + 1 : r 3].

14 spawn P-MERGE-AUX ( A, q 1 + 1, r 1, q 2, r 2, B, q 3 + 1) 15 sync

// wait for spawns

We’re now at the crux of P-MERGE-AUX: implementing the

parallel divide-and-conquer strategy. As we continue our pseudocode

walk, you may find it helpful to refer again to Figure 26.6.

First the divide step. Line 6 computes the midpoint q 1 of A[ p 1 : r 1], which indexes a median x = A[ q 1] of this subarray to be used as the pivot, and line 7 determines x itself. Next, line 8 uses the FIND-SPLIT-POINT procedure to find the index q 2 in A[ p 2 : r 2] such that all elements in A[ p 2 : q 2 − 1] are at most x and all the elements in A[ q 2 : r 2]

are at least x. Line 9 computes the index q 3 of the element that divides the output subarray B[ p 3 : r 3] into B[ p 3 : q 3 − 1] and B[ q 3 + 1 : r 3], and then line 10 puts x directly into B[ q 3], which is where it belongs in the output.

Next is the conquer step, which is where the parallel recursion occurs. Lines 12 and 14 each spawn P-MERGE-AUX to recursively

merge from A into B, the first to merge the smaller elements and the second to merge the larger elements. The sync statement in line 15

ensures that the subproblems finish before the procedure returns.

There is no combine step, as B[ p : r] already contains the correct sorted output.

Work/span analysis of parallel merging

Let’s first analyze the worst-case span T∞( n) of P-MERGE-AUX on input subarrays that together contain a total of n elements. The call to

FIND-SPLIT-POINT in line 8 contributes Θ(lg n) to the span in the

worst case, and the procedure performs at most a constant amount of

additional serial work outside of the two recursive spawns in lines 12

and 14.

Because the two recursive spawns operate logically in parallel, only

one of them contributes to the overall worst-case span. We claimed

earlier that neither recursive invocation ever operates on more than 3 n/4

elements. Let’s see why. Let n 1 = r 1 − p 1 + 1 and n 2 = r 2 − p 2 + 1, where n = n 1 + n 2, be the sizes of the two subarrays when line 6 starts executing, that is, after we have established that n 2 ≤ n 1 by swapping the roles of the two subarrays, if necessary. Since the pivot x is a median of

of A[ p 1 : r 1], in the worst case, a recursive merge involves at most n 1/2

elements of A[ p 1 : r 1], but it might involve all n 2 of the elements of A[ p 2

: r 2]. Thus we can bound the number of elements involved in a recursive

invocation of P-MERGE-AUX by

n 1/2 + n 2 = (2 n 1 + 4 n 2)/4

≤ (3 n 1 + 3 n 2)/4 (since n 2 ≤ n 1)

= 3 n/4,

proving the claim.

The worst-case span of P-MERGE-AUX can therefore be described

by the following recurrence:

Image 881

Image 882

Because this recurrence falls under case 2 of the master theorem with k

= 1, its solution is T∞( n) = Θ(lg 2 n).

Now let’s verify that the work T 1( n) of P-MERGE-AUX on n elements is linear. A lower bound of Ω( n) is straightforward, since each

of the n elements is copied from array A to array B. We’ll show that T 1( n) = O( n) by deriving a recurrence for the worst-case work. The binary search in line 8 costs Θ(lg n) in the worst case, which dominates

the other work outside of the recursive spawns. For the recursive

spawns, observe that although lines 12 and 14 might merge different

numbers of elements, the two recursive spawns together merge at most n

− 1 elements (since x = A[ q] is not merged). Moreover, as we saw when analyzing the span, a recursive spawn operates on at most 3 n/4 elements.

We therefore obtain the recurrence

where α lies in the range 1/4 ≤ α ≤ 3/4. The value of α can vary from one recursive invocation to another.

We’ll use the substitution method (see Section 4.3) to prove that the above recurrence (26.8) has solution T 1( n) = O( n). (You could also use the Akra-Bazzi method from Section 4.7.) Assume that T 1( n) ≤ c 1 nc 2

lg n for some positive constants c 1 and c 2. Using the properties of logarithms on pages 66–67—in particular, to deduce that lg α + lg(1 −

α) = −Θ(1)—substitution yields

T 1( n) ≤ ( c 1 αnc 2 lg( αn)) + ( c 1(1 − α) nc 2 lg((1 − α) n)) + Θ(lg n)

= c 1( α + (1 − α)) nc 2(lg( αn) + lg((1 − α) n)) + Θ(lg n)

= c 1 nc 2(lg α + lg n + lg(1 − α) + lg n) + Θ(lg n)

= c 1 nc 2 lg nc 2(lg n + lg α + lg(1 − α)) + Θ(lg n)

= c 1 nc 2 lg nc 2(lg n − Θ(1)) + Θ(lg n)

c 1 nc 2 lg n,

if we choose c 2 large enough that the c 2(lg n − Θ(1)) term dominates the Θ(lg n) term for sufficiently large n. Furthermore, we can choose c 1

large enough to satisfy the implied Θ(1) base cases of the recurrence,

completing the induction. The lower and upper bounds of Ω( n) and

O( n) give T 1( n) = Θ( n), asymptotically the same work as for serial merging.

The execution of the pseudocode in the P-MERGE procedure itself

does not add asymptotically to the work and span of P-MERGE-AUX.

The parallel for loop in lines 3–4 has Θ(lg n) span due to the loop control, and each iteration runs in constant time. Thus the Θ(lg2 n) span

of P-MERGE-AUX dominates, yielding Θ(lg2 n) span overall for P-

MERGE. The parallel for loop contains Θ( n) work, matching the

asymptotic work of P-MERGE-AUX and yielding Θ( n) work overall

for P-MERGE.

Analysis of parallel merge sort

The “heavy lifting” is done. Now that we have determined the work and

span of P-MERGE, we can analyze P-MERGE-SORT. Let T 1( n) and

T∞( n) be the work and span, respectively, of P-MERGE-SORT on an

array of n elements. The call to P-MERGE in line 10 of P-MERGE-

SORT dominates the costs of lines 1–3, for both work and span. Thus

we obtain the recurrence

T 1( n) = 2 T 1( n/2) + Θ( n)

for the work of P-MERGE-SORT, and we obtain the recurrence

T∞( n) = T∞( n/2) + Θ(lg2 n)

for its span. The work recurrence has solution T 1( n) = Θ( n lg n) by case 2 of the master theorem with k = 0. The span recurrence has solution

T∞ ( n) = Θ(lg3 n), also by case 2 of the master theorem, but with k = 2.

Parallel merging gives P-MERGE-SORT a parallelism advantage

over P-NAIVE-MERGE-SORT. The parallelism of P-NAIVE-

MERGE-SORT, which calls the serial MERGE procedure, is only Θ(lg

n). For P-MERGE-SORT, the parallelism is

T 1( n)/ T∞( n) = Θ( n lg n)/Θ(lg3 n)

= Θ( n/lg2 n),

which is much better, both in theory and in practice. A good

implementation in practice would sacrifice some parallelism by

coarsening the base case in order to reduce the constants hidden by the

asymptotic notation. For example, you could switch to an efficient serial

sort, perhaps quicksort, when the number of elements to be sorted is

sufficiently small.

Exercises

26.3-1

Explain how to coarsen the base case of P-MERGE.

26.3-2

Instead of finding a median element in the larger subarray, as P-

MERGE does, suppose that the merge procedure finds a median of all

the elements in the two sorted subarrays using the result of Exercise 9.3-

10. Give pseudocode for an efficient parallel merging procedure that

uses this median-finding procedure. Analyze your algorithm.

26.3-3

Give an efficient parallel algorithm for partitioning an array around a

pivot, as is done by the PARTITION procedure on page 184. You need

not partition the array in place. Make your algorithm as parallel as

possible. Analyze your algorithm. ( Hint: You might need an auxiliary

array and might need to make more than one pass over the input

elements.)

26.3-4

Give a parallel version of FFT on page 890. Make your implementation

as parallel as possible. Analyze your algorithm.

26.3-5

Show how to parallelize SELECT from Section 9.3. Make your implementation as parallel as possible. Analyze your algorithm.

Problems

26-1 Implementing parallel loops using recursive spawning

Consider the parallel procedure SUM-ARRAYS for performing

pairwise addition on n-element arrays A[1 : n] and B[1 : n], storing the sums in C [1 : n].

SUM-ARRAYS ( A, B, C, n)

1 parallel for i = 1 to n

2

C [ i] = A[ i] + B[ i]

a. Rewrite the parallel loop in SUM-ARRAYS using recursive spawning

in the manner of P-MAT-VEC-RECURSIVE. Analyze the

parallelism.

Consider another implementation of the parallel loop in SUM-

ARRAYS given by the procedure SUM-ARRAYS′, where the value

grain- size must be specified.

SUM-ARRAYS′( A, B, C, n)

1 grain- size = ?

// to be determined

2 r = ⌈ n/ grain- size

3 for k = 0 to r − 1

4

spawn ADD-SUBARRAY ( A, B, C, k · grain- size + 1, min {( k + 1) · grain- size, n})

5 sync

ADD-SUBARRAY ( A, B, C, i, j)

1 for k = i to j

2

C [ k] = A[ k] + B[ k]

b. Suppose that you set grain- size = 1. What is the resulting parallelism?

c. Give a formula for the span of SUM-ARRAYS′ in terms of n and

grain- size. Derive the best value for grain- size to maximize parallelism.

26-2 Avoiding a temporary matrix in recursive matrix multiplication

The P-MATRIX-MULTIPLY-RECURSIVE procedure on page 772

must allocate a temporary matrix D of size n × n, which can adversely affect the constants hidden by the Θ-notation. The procedure has high

parallelism, however: Θ( n 3/log2 n). For example, ignoring the constants in the Θ-notation, the parallelism for multiplying 1000 × 1000 matrices

comes to approximately 10003/102 = 107, since lg 1000 ≈ 10. Most

parallel computers have far fewer than 10 million processors.

a. Parallelize MATRIX-MULTIPLY-RECURSIVE without using

temporary matrices so that it retains its Θ( n 3) work. ( Hint: Spawn the recursive calls, but insert a sync in a judicious location to avoid races.)

b. Give and solve recurrences for the work and span of your

implementation.

c. Analyze the parallelism of your implementation. Ignoring the

constants in the Θ-notation, estimate the parallelism on 1000 × 1000

matrices. Compare with the parallelism of P-MATRIX-MULTIPLY-

RECURSIVE, and discuss whether the trade-off would be

worthwhile.

26-3 Parallel matrix algorithms

Before attempting this problem, it may be helpful to read Chapter 28.

a. Parallelize the LU-DECOMPOSITION procedure on page 827 by

giving pseudocode for a parallel version of this algorithm. Make your

implementation as parallel as possible, and analyze its work, span,

and parallelism.

b. Do the same for LUP-DECOMPOSITION on page 830.

c. Do the same for LUP-SOLVE on page 824.

d. Using equation (28.14) on page 835, write pseudocode for a parallel algorithm to invert a symmetric positive-definite matrix. Make your

implementation as parallel as possible, and analyze its work, span,

and parallelism.

26-4 Parallel reductions and scan (prefix) computations

A ⊗- reduction of an array x[1 : n], where ⊗ is an associative operator, is the value y = x[1] ⊗ x[2] ⊗ ⋯ ⊗ x[ n]. The REDUCE procedure computes the ⊗-reduction of a subarray x[ i : j] serially.

REDUCE ( x, i, j)

1 y = x[ i]

2for k = i + 1 to j

3

y = yx[ k]

4return y

a. Design and analyze a parallel algorithm P-REDUCE that uses

recursive spawning to perform the same function with Θ( n) work and

Θ(lg n) span.

A related problem is that of computing a ⊗ -scan, sometimes called a

-prefix computation, on an array x[1 : n], where ⊗ is once again an associative operator. The ⊗-scan, implemented by the serial procedure

SCAN, produces the array y[1 : n] given by

y[1] = x[1],

y[2] = x[1] ⊗ x[2],

y[3] = x[1] ⊗ x[2] ⊗ x[3],

y[ n] = x[1] ⊗ x[2] ⊗ x[3] ⊗ ⋯ ⊗ x[ n], that is, all prefixes of the array x “summed” using the ⊗ operator.

SCAN ( x, n)

1 let y[1 : n] be a new array

2 y[1] = x[1]

3 for i = 2 to n

4

y[ i] = y[ i − 1] ⊗ 1 ⊗ x[ i]

5 return y

Parallelizing SCAN is not straightforward. For example, simply

changing the for loop to a parallel for loop would create races, since

each iteration of the loop body depends on the previous iteration. The

procedures P-SCAN-1 and P-SCAN-1-AUX perform the ⊗-scan in

parallel, albeit inefficiently.

P-SCAN-1( x, n)

1 let y[1] : n be a new array

2 P-SCAN-1-AUX ( x, y, 1, n)

3 return y

P-SCAN-1-AUX ( x, y, i, j)

1 parallel for l = i to j

2

y[ l] = P-REDUCE ( x, 1, l)

b. Analyze the work, span, and parallelism of P-SCAN-1.

The procedures P-SCAN-2 and P-SCAN-2-AUX use recursive

spawning to perform a more efficient ⊗-scan.

P-SCAN-2( x, n)

1 let y[1] : n be a new array

2 P-SCAN-2-AUX ( x, y, 1, n)

3 return y

P-SCAN-2-AUX ( x, y, i, j)

1 if i == j

2

y[ i] = x[ i]

3 else k = ⌊( i + j)/2⌊

4

spawn P-SCAN-2-AUX ( x, y, i, k)

5

P-SCAN-2-AUX ( x, y, k + 1, j)

6

sync

7

parallel for l = k + 1 to j

8

y[ l] = y[ k] ⊗ y[ l]

c. Argue that P-SCAN-2 is correct, and analyze its work, span, and

parallelism.

To improve on both P-SCAN-1 and P-SCAN-2, perform the ⊗-scan

in two distinct passes over the data. The first pass gathers the terms for

various contiguous subarrays of x into a temporary array t, and the second pass uses the terms in t to compute the final result y. The pseudocode in the procedures P-SCAN-3, P-SCAN-UP, and P-SCAN-DOWN on the facing page implements this strategy, but certain

expressions have been omitted.

d. Fill in the three missing expressions in line 8 of P-SCAN-UP and

lines 5 and 6 of P-SCAN-DOWN. Argue that with the expressions

you supplied, P-SCAN-3 is correct. ( Hint: Prove that the value v

passed to P-SCAN-DOWN ( v, x, t, y, i, j) satisfies v = x[1] ⊗ x[2] ⊗

⋯ ⊗ x[ i − 1].)

e. Analyze the work, span, and parallelism of P-SCAN-3.

f. Describe how to rewrite P-SCAN-3 so that it doesn’t require the use

of the temporary array t.

g. Give an algorithm P-SCAN-4( x, n) for a scan that operates in place. It should place its output in x and require only constant

auxiliary storage.

h. Describe an efficient parallel algorithm that uses a +-scan to

determine whether a string of parentheses is well formed. For

example, the string ( ( ) ( ) ) ( ) is well formed, but the string ( ( ) ) ) ( ( )

is not. ( Hint: Interpret ( as a 1 and ) as a −1, and then perform a +-

scan.)

P-SCAN-3( x, n)

1 let y[1] : n and t[1 : n] be new arrays

2 y[1] = x[1]

3 if n > 1

4

P-SCAN-UP ( x, t, 2, n)

5

P-SCAN-DOWN ( x[1], x, t, y, 2, n)

6 return y

P-SCAN-UP ( x, t, i, j)

1 if i == j

2

return x[ i]

3 else

4

k = ⌊( i + j)/2⌊

5

t[ k] = spawn P-SCAN-UP ( x, t, i, k)

6

right = P-SCAN-UP ( x, t, k + 1, j)

7

sync

8

return ____

// fill in the blank

P-SCAN-DOWN ( v, x, t, y, i, j)

1 if i == j

2

y[ i] = vx[ i]

3 else

4

k = ⌊( i + j)/2⌊

5

spawn P-SCAN-DOWN (____, x, t, y, i,// fill in the

k)

blank

6

P-SCAN-DOWN (____, x, t, y, k + 1, j) // fill in the blank

7

sync

26-5 Parallelizing a simple stencil calculation

Computational science is replete with algorithms that require the entries

of an array to be filled in with values that depend on the values of

certain already computed neighboring entries, along with other

information that does not change over the course of the computation.

The pattern of neighboring entries does not change during the

computation and is called a stencil. For example, Section 14.4 presents a stencil algorithm to compute a longest common subsequence, where the

Image 883

value in entry c[ i, j] depends only on the values in c[ i − 1, j], c[ i, j − 1], and c[ i − 1, j − 1], as well as the elements xi and yj within the two sequences given as inputs. The input sequences are fixed, but the

algorithm fills in the two-dimensional array c so that it computes entry

c[ i, j] after computing all three entries c[ i − 1, j], c[ i, j − 1], and c[ i − 1, j

− 1].This problem examines how to use recursive spawning to parallelize

a simple stencil calculation on an n × n array A in which the value placed into entry A[ i, j] depends only on values in A[ i′, j′], where i′ ≤ i and j′ ≤ j (and of course, i′ ≠ i or j′ ≠ j). In other words, the value in an entry depends only on values in entries that are above it and/or to its

left, along with static information outside of the array. Furthermore, we

assume throughout this problem that once the entries upon which A[ i, j]

depends have been filled in, the entry A[ i, j] can be computed in Θ(1) time (as in the LCS-LENGTH procedure of Section 14.4).

Partition the n × n array A into four n/2 × n/2 subarrays as follows: You can immediately fill in subarray A 11 recursively, since it does not

depend on the entries in the other three subarrays. Once the

computation of A 11 finishes, you can fill in A 12 and A 21 recursively in parallel, because although they both depend on A 11, they do not

depend on each other. Finally, you can fill in A 22 recursively.

a. Give parallel pseudocode that performs this simple stencil calculation

using a divide-and-conquer algorithm SIMPLE-STENCIL based on

the decomposition (26.9) and the discussion above. (Don’t worry

about the details of the base case, which depends on the specific

stencil.) Give and solve recurrences for the work and span of this

algorithm in terms of n. What is the parallelism?

b. Modify your solution to part (a) to divide an n × n array into nine n/3

× n/3 subarrays, again recursing with as much parallelism as possible.

Analyze this algorithm. How much more or less parallelism does this

algorithm have compared with the algorithm from part (a)?

c. Generalize your solutions to parts (a) and (b) as follows. Choose an

integer b ≥ 2. Divide an n × n array into b 2 subarrays, each of size n/ b

× n/ b, recursing with as much parallelism as possible. In terms of n and b, what are the work, span, and parallelism of your algorithm?

Argue that, using this approach, the parallelism must be o( n) for any

choice of b ≥ 2. ( Hint: For this argument, show that the exponent of n in the parallelism is strictly less than 1 for any choice of b ≥ 2.)

d. Give pseudocode for a parallel algorithm for this simple stencil

calculation that achieves Θ( n/lg n) parallelism. Argue using notions of work and span that the problem has Θ( n) inherent parallelism.

Unfortunately, simple fork-join parallelism does not let you achieve

this maximal parallelism.

26-6 Randomized parallel algorithms

Like serial algorithms, parallel algorithms can employ random-number

generators. This problem explores how to adapt the measures of work,

span, and parallelism to handle the expected behavior of randomized

task-parallel algorithms. It also asks you to design and analyze a

parallel algorithm for randomized quicksort.

a. Explain how to modify the work law (26.2), span law (26.3), and

greedy scheduler bound (26.4) to work with expectations when TP,

T 1, and T∞are all random variables.

b. Consider a randomized parallel algorithm for which 1% of the time,

T 1 = 104 and T 10,000 = 1, but for the remaining 99% of the time, T 1

= T 10,000 = 109. Argue that the speedup of a randomized parallel

algorithm should be defined as E[ T 1]/ E[ TP], rather than E[ T 1/ TP].

c. Argue that the parallelism of a randomized task-parallel algorithm

should be defined as the ratio E[ T 1]/ E[ T∞].

d. Parallelize the RANDOMIZED-QUICKSORT algorithm on page

192 by using recursive spawning to produce P-RANDOMIZED-

QUICKSORT. (Do not parallelize RANDOMIZED-PARTITION.)

e. Analyze your parallel algorithm for randomized quicksort. ( Hint:

Review the analysis of RANDOMIZED-SELECT on page 230.)

f. Parallelize RANDOMIZED-SELECT on page 230. Make your

implementation as parallel as possible. Analyze your algorithm. ( Hint:

Use the partitioning algorithm from Exercise 26.3-3.)

Chapter notes

Parallel computers and algorithmic models for parallel programming

have been around in various forms for years. Prior editions of this book

included material on sorting networks and the PRAM (Parallel

Random-Access Machine) model. The data-parallel model [58, 217] is another popular algorithmic programming model, which features

operations on vectors and matrices as primitives. The notion of

sequential consistency is due to Lamport [275].

Graham [197] and Brent [71] showed that there exist schedulers achieving the bound of Theorem 26.1. Eager, Zahorjan, and Lazowska

[129] showed that any greedy scheduler achieves this bound and proposed the methodology of using work and span (although not by

those names) to analyze parallel algorithms. Blelloch [57] developed an algorithmic programming model based on work and span (which he

called “depth”) for data-parallel programming. Blumofe and Leiserson

[63] gave a distributed scheduling algorithm for task-parallel computations based on randomized “work-stealing” and showed that it

achieves the bound E[ TP] ≤ T 1/ P + O( T∞). Arora, Blumofe, and Plaxton [20] and Blelloch, Gibbons, and Matias [61] also provided provably good algorithms for scheduling task-parallel computations.

The recent literature contains many algorithms and strategies for

scheduling parallel programs.

The parallel pseudocode and programming model were influenced by

Cilk [290, 291, 383, 396]. The open-source project OpenCilk

(www.opencilk.org) provides Cilk programming as an extension to the C and C++ programming languages. All of the parallel algorithms in

this chapter can be coded straightforwardly in Cilk.

Concerns about nondeterministic parallel programs were expressed

by Lee [281] and Bocchino, Adve, Adve, and Snir [64]. The algorithms literature contains many algorithmic strategies (see, for example, [60, 85,

118, 140, 160, 282, 283, 412, 461]) for detecting races and extending the fork-join model to avoid or safely embrace various kinds of

nondeterminism. Blelloch, Fineman, Gibbons, and Shun [59] showed that deterministic parallel algorithms can often be as fast as, or even

faster than, their nondeterministic counterparts.

Several of the parallel algorithms in this chapter appeared in

unpublished lecture notes by C. E. Leiserson and H. Prokop and were

originally implemented in Cilk. The parallel merge-sorting algorithm

was inspired by an algorithm due to Akl [12].

1 In mathematics, a projection is an idempotent function, that is, a function f such that ff = f.

In this case, the function f maps the set P of fork-join programs to the set P S ⊂ P of serial programs, which are themselves fork-join programs with no parallelism. For a fork-join program x ∈ P, since we have f ( f ( x)) = f ( x), the serial projection, as we have defined it, is indeed a mathematical projection.

2 Also called a computation dag in the literature.

27 Online Algorithms

Most problems described in this book have assumed that the entire

input was available before the algorithm executes. In many situations,

however, the input becomes available not in advance, but only as the

algorithm executes. This idea was implicit in much of the discussion of

data structures in Part III. The reason that you want to design, for example, a data structure that can handle n INSERT, DELETE, and

SEARCH operations in O(lg n) time per operation is most likely because you are going to receive n such operation requests without

knowing in advance what operations will be coming. This idea was also

implicit in amortized analysis in Chapter 16, where we saw how to maintain a table that can grow or shrink in response to a sequence of

insertion and deletion operations, yet with a constant amortized cost

per operation.

An online algorithm receives its input progressively over time, rather

than having the entire input available at the start, as in an offline

algorithm. Online algorithms pertain to many situations in which

information arrives gradually. A stock trader must make decisions

today, without knowing what the prices will be tomorrow, yet wants to

achieve good returns. A computer system must schedule arriving jobs

without knowing what work will need to be done in the future. A store

must decide when to order more inventory without knowing what the

future demand will be. A driver for a ride-hailing service must decide

whether to pick up a fare without knowing who will request rides in the

future. In each of these situations, and many more, algorithmic

decisions must be made without knowledge of the future.

There are several approaches for dealing with unknown future

inputs. One approach is to form a probabilistic model of future inputs

and design an algorithm that assumes future inputs conform to the

model. This technique is common, for example, in the field of queuing

theory, and it is also related to machine learning. Of course, you might

not be able to develop a workable probabilistic model, or even if you

can, some inputs might not conform to it. This chapter takes a different

approach. Instead of assuming anything about the future input, we

employ a conservative strategy of limiting how poor a solution any

input can entail.

This chapter, therefore, adopts a worst-case approach, designing

online algorithms that guarantee the quality of the solution for all

possible future inputs. We’ll analyze online algorithms by comparing

the solution produced by the online algorithm with a solution produced

by an optimal algorithm that knows the future inputs, and taking a

worst-case ratio over all possible instances. We call this methodology

competitive analysis. We’ll use a similar approach when we study

approximation algorithms in Chapter 35, where we’ll compare the solution returned by an algorithm that might be suboptimal with the

value of the optimal solution, and determine a worst-case ratio over all

possible instances.

We start with a “toy” problem: deciding between whether to take the

elevator or the stairs. This problem will introduce the basic

methodology of thinking about online algorithms and how to analyze

them via competitive analysis. We will then look at two problems that

use competitive analysis. The first is how to maintain a search list so

that the access time is not too large, and the second is about strategies

for deciding which cache blocks to evict from a cache or other kind of

fast computer memory.

27.1 Waiting for an elevator

Our first example of an online algorithm models a problem that you

likely have encountered yourself: whether you should wait for an

elevator to arrive or just take the stairs. Suppose that you enter a

Image 884

building and wish to visit an office that is k floors up. You have two choices: walk up the stairs or take the elevator. Let’s assume, for

convenience, that you can climb the stairs at the rate of one floor per

minute. The elevator travels much faster than you can climb the stairs: it

can ascend all k floors in just one minute. Your dilemma is that you do

not know how long it will take for the elevator to arrive at the ground

floor and pick you up. Should you take the elevator or the stairs? How

do you decide?

Let’s analyze the problem. Taking the stairs takes k minutes, no

matter what. Suppose you know that the elevator takes at most B − 1

minutes to arrive for some value of B that is considerably higher than k.

(The elevator could be going up when you call for it and then stop at

several floors on its way down.) To keep things simple, let’s also assume

that the number of minutes for the elevator to arrive is an integer.

Therefore, waiting for the elevator and taking it k floors up takes

anywhere from one minute (if the elevator is already at the ground floor)

to ( B − 1) + 1 = B minutes (the worst case). Although you know B and k, you don’t know how long the elevator will take to arrive this time.

You can use competitive analysis to inform your decision regarding

whether to take the stairs or elevator. In the spirit of competitive

analysis, you want to be sure that, no matter what the future brings (i.e.,

how long the elevator takes to arrive), you will not wait much longer

than a seer who knows when the elevator will arrive.

Let us first consider what the seer would do. If the seer knows that

the elevator is going to arrive in at most k − 1 minutes, the seer waits for

the elevator, and otherwise, the seer takes the stairs. Letting m denote

the number of minutes it takes for the elevator to arrive at the ground

floor, we can express the time that the seer spends as the function

We typically evaluate online algorithms by their competitive ratio.

Let U denote the set (universe) of all possible inputs, and consider some

input IU. For a minimization problem, such as the stairs-versus-elevator problem, if an online algorithm A produces a solution with

Image 885

Image 886

Image 887

value A( I) on input I and the solution from an algorithm F that knows the future has value F( I) on the same input, then the competitive ratio of algorithm A is

max { A( I)/ F( I) : IU}.

If an online algorithm has a competitive ratio of c, we say that it is c-

competitive. The competitive ratio is always at least 1, so that we want

an online algorithm with a competitive ratio as close to 1 as possible.

In the stairs-versus-elevator problem, the only input is the time for

the elevator to arrive. Algorithm F knows this information, but an

online algorithm has to make a decision without knowing when the

elevator will arrive. Consider the algorithm “always take the stairs,”

which always takes exactly k minutes. Using equation (27.1), the

competitive ratio is

Enumerating the terms in equation (27.2) gives the competitive ratio as

so that the competitive ratio is k. The maximum is achieved when the

elevator arrives immediately. In this case, taking the stairs requires k

minutes, but the optimal solution takes just 1 minute.

Now let’s consider the opposite approach: “always take the elevator.”

If it takes m minutes for the elevator to arrive at the ground floor, then

this algorithm will always take m + 1 minutes. Thus the competitive

ratio becomes

max {( m + 1)/ t( m) : 0 ≤ mB − 1},

which we can again enumerate as

Now the maximum is achieved when the elevator takes B − 1 minutes to

arrive, compared with the optimal approach of taking the stairs, which

requires k minutes.

Image 888

Image 889

Hence, the algorithm “always take the stairs” has competitive ratio

k, and the algorithm “always take the elevator” has competitive ratio B/ k. Because we prefer the algorithm with smaller competitive ratio, if k

= 10 and B = 300, we prefer “always take the stairs,” with competitive

ratio 10, over “always take the elevator,” with competitive ratio 30.

Taking the stairs is not always better, or necessarily more often better.

It’s just that taking the stairs guards better against the worst-case future.

These two approaches of always taking the stairs and always taking

the elevator are extreme solutions, however. Instead, you can “hedge

your bets” and guard even better against a worst-case future. In

particular, you can wait for the elevator for a while, and then if it doesn’t

arrive, take the stairs. How long is “a while”? Let’s say that “a while” is

k minutes. Then the time h( m) required by this hedging strategy, as a function of the number m of minutes before the elevator arrives, is

In the second case, h( m) = 2 k because you wait for k minutes and then climb the stairs for k minutes. The competitive ratio is now

max { h( m)/ t( m) : 0 ≤ mB − 1}.

Enumerating this ratio yields

The competitive ratio is now independent of k and B.

This example illustrates a common philosophy in online algorithms:

we want an algorithm that guards against any possible worst case.

Initially, waiting for the elevator guards against the case when the

elevator arrives quickly, but eventually switching to the stairs guards

against the case when the elevator takes a long time to arrive.

Exercises

27.1-1

Suppose that when hedging your bets, you wait for p minutes, instead of for k minutes, before taking the stairs. What is the competitive ratio as a

function of p and k? How should you choose p to minimize the competitive ratio?

27.1-2

Imagine that you decide to take up downhill skiing. Suppose that a pair

of skis costs r dollars to rent for a day and b dollars to buy, where b > r.

If you knew in advance how many days you would ever ski, your

decision whether to rent or buy would be easy. If you’ll ski for at least

b/ r⌉ days, then you should buy skis, and otherwise you should rent.

This strategy minimizes the total that you ever spend. In reality, you

don’t know in advance how many days you’ll eventually ski. Even after

you have skied several times, you still don’t know how many more times

you’ll ever ski. Yet you don’t want to waste your money. Give and

analyze an algorithm that has a competitive ratio of 2, that is, an

algorithm guaranteeing that, no matter how many times you ski, you

never spend more than twice what you would have spent if you knew

from the outset how many times you’ll ski.

27.1-3

In “concentration solitaire,” a game for one person, you have n pairs of

matching cards. The backs of the cards are all the same, but the fronts

contain pictures of animals. One pair has pictures of aardvarks, one pair

has pictures of bears, one pair has pictures of camels, and so on. At the

start of the game, the cards are all placed face down. In each round, you

can turn two cards face up to reveal their pictures. If the pictures match,

then you remove that pair from the game. If they don’t match, then you

turn both of them over, hiding their pictures once again. The game ends

when you have removed all n pairs, and your score is how many rounds

you needed to do so. Suppose that you can remember the picture on

every card that you have seen. Give an algorithm to play concentration

solitaire that has a competitive ratio of 2.

27.2 Maintaining a search list

The next example of an online algorithm pertains to maintaining the order of elements in a linked list, as in Section 10.2. This problem often arises in practice for hash tables when collisions are resolved by

chaining (see Section 11.2), since each slot contains a linked list.

Reordering the linked list of elements in each slot of the hash table can

boost the performance of searches measurably.

The list-maintenance problem can be set up as follows. You are given

a list L of n elements { x 1, x 2, … , xn}. We’ll assume that the list is doubly linked, although the algorithms and analysis work just as well

for singly linked lists. Denote the position of element xi in the list L by rL( xi), where 1 ≤ rL( xi) ≤ n. Calling LIST-SEARCH( L, xi) on page 260

thus takes Θ( rL( xi)) time.

If you know in advance something about the distribution of search

requests, then it makes sense to arrange the list ahead of time to put the

more frequently searched elements closer to the front, which minimizes

the total cost (see Exercise 27.2-1). If instead you don’t know anything

about the search sequence, then no matter how you arrange the list, it is

possible that every search is for whatever element appears at the tail of

the list. The total searching time would then be Θ( nm), where m is the

number of searches.

If you notice patterns in the access sequence or you observe

differences in the frequencies in which elements are accessed, then you

might want to rearrange the list as you perform searches. For example,

if you discover that every search is for a particular element, you could

move that element to the front of the list. In general, you could

rearrange the list after each call to LIST-SEARCH. But how would you

do so without knowing the future? After all, no matter how you move

elements around, every search could be for the last element.

But it turns out that some search sequences are “easier” than others.

Rather than just evaluate performance on the worst-case sequence, let’s

compare a reorganization scheme with whatever an optimal offline

algorithm would do if it knew the search sequence in advance. That way,

if the sequence is fundamentally hard, the optimal offline algorithm will

also find it hard, but if the sequence is easy, we can hope to do

reasonably well.

Image 890

To ease analysis, we’ll drop the asymptotic notation and say that the

cost is just i to search for the i th element in the list. Let’s also assume that the only way to reorder the elements in the list is by swapping two

adjacent elements in the list. Because the list is doubly linked, each swap

incurs a cost of 1. Thus, for example, a search for the sixth element

followed by moving it forward two places (entailing two swaps) incurs a

total cost 8. The goal is to minimize the total cost of calls to LIST-

SEARCH plus the total number of swaps performed.

The online algorithm that we’ll explore is MOVE-TO-FRONT( L, x).

This procedure first searches for x in the doubly linked list L, and then it moves x to the front of the list.1 If x is located at position r = rL( x) before the call, MOVE-TO-FRONT swaps x with the element in

position r − 1, then with the element in position r − 2, and so on, until it finally swaps x with the element in position 1. Thus if the call MOVE-TO-FRONT( L, 8) executes on the list L = 〈5, 3, 12, 4, 8, 9, 22〉, the list becomes 〈8, 5, 3, 12, 4, 9, 22〉. The call MOVE-TO-FRONT( L, k) costs

2 rL( k) − 1: it costs rL( k) to search for k, and it costs 1 for each of the rL( k) − 1 swaps that move k to the front of the list.

Figure 27.1 The costs incurred by the procedures FORESEE and MOVE-TO-FRONT when searching for the elements 5, 3, 4, and 4, starting with the list L = 〈1, 2, 3, 4, 5〉. If FORESEE

instead moved 3 to the front after the search for 5, the cumulative cost would not change, nor would the cumulative cost change if 4 moved to the second position after the search for 5.

We’ll see that MOVE-TO-FRONT has a competitive ratio of 4. Let’s

think about what this means. MOVE-TO-FRONT performs a series of

operations on a doubly linked list, accumulating cost. For comparison,

suppose that there is an algorithm FORESEE that knows the future.

Like MOVE-TO-FRONT, it also searches the list and moves elements

around, but after each call it optimally rearranges the list for the future.

Image 891

Image 892

Image 893

Image 894

Image 895

(There may be more than one optimal order.) Thus FORESEE and

MOVE-TO-FRONT maintain different lists of the same elements.

Consider the example shown in Figure 27.1. Starting with the list 〈1, 2, 3, 4, 5〉, four searches occur, for the elements 5, 3, 4, and 4. The

hypothetical procedure FORESEE, after searching for 3, moves 4 to the

front of the list, knowing that a search for 4 is imminent. It thus incurs a

swap cost of 3 upon its second call, after which no further swap costs

accrue. MOVE-TO-FRONT incurs swap costs in each step, moving the

found element to the front. In this example, MOVE-TO-FRONT has a

higher cost in each step, but that is not necessarily always the case.

The key to proving the competitive bound is to show that at any

point, the total cost of MOVE-TO-FRONT is not much higher than

that of FORESEE. Surprisingly, we can determine a bound on the costs

incurred by MOVE-TO-FRONT relative to FORESEE even though

MOVE-TO-FRONT cannot see the future.

If we compare any particular step, MOVE-TO-FRONT and

FORESEE may be operating on very different lists and do very

different things. If we focus on the search for 4 above, we observe that

FORESEE actually moves it to the front of the list early, paying to

move the element to the front before it is accessed. To capture this

concept, we use the idea of an inversion: a pair of elements, say a and b, in which a appears before b in one list, but b appears before a in another list. For two lists L and L′, let I( L, L′), called the inversion count, denote the number of inversions between the two lists, that is, the number of

pairs of elements whose order differs in the two lists. For example, with

lists L = 〈5,3,1,4,2〉 and L′ = 〈3,1,2,4,5〉, then out of the

pairs,

exactly five of them—(1, 5), (2, 4), (2, 5), (3, 5), (4, 5)—are inversions,

since these pairs, and only these pairs, appear in different orders in the

two lists. Thus the inversion count is I( L, L′) = 5.

In order to analyze the algorithm, we define the following notation.

Let be the list maintained by MOVE-TO-FRONT immediately after

the i th search, and similarly, let be FORESEE’s list immediately after

the i th search. Let and be the costs incurred by MOVE-TO-

FRONT and FORESEE on their i th calls, respectively. We don’t know

how many swaps FORESEE performs in its i th call, but we’ll denote

Image 896

Image 897

Image 898

Image 899

Image 900

Image 901

Image 902

Image 903

Image 904

Image 905

Image 906

Image 907

that number by ti. Therefore, if the i th operation is a search for element x, then

In order to compare these costs more carefully, let’s break down the

elements into subsets, depending on their positions in the two lists

before the i th search, relative to the element x being searched for in the i th search. We define three sets:

BB = {elements before x in both and },

BA = {elements before x in but after x in },

AB = {elements after x in but before x in }.

We can now relate the position of element x in and to the sizes of

these sets:

When a swap occurs in one of the lists, it changes the relative

positions of the two elements involved, which in turn changes the

inversion count. Suppose that elements x and y are swapped in some list. Then the only possible difference in the inversion count between

this list and any other list depends on whether ( x, y) is an inversion. In fact, the inversion count of ( x, y) with respect to any other list must change. If ( x, y) is an inversion before the swap, it no longer is afterward, and vice versa. Therefore, if two consecutive elements x and

y swap positions in a list L, then for any other list L′, the value of the inversion count I( L, L′) either increases by 1 or decreases by 1.

As we compare MOVE-TO-FRONT and FORESEE searching and

modifying their lists, we’ll think about MOVE-TO-FRONT executing

on its list for the i th time and then FORESEE executing on its list for

the i th time. After MOVE-TO-FRONT has executed for the i th time and before FORESEE has executed for the i th time, we’ll compare

(the inversion count immediately before the i th call of MOVE-

TO-FRONT) with

(the inversion count after the i th call of

Image 908

Image 909

Image 910

Image 911

Image 912

Image 913

Image 914

Image 915

Image 916

Image 917

Image 918

MOVE-TO-FRONT but before the i th call of FORESEE). We’ll

concern ourselves later with what FORESEE does.

Let us analyze what happens to the inversion count after executing

the i th call of MOVE-TO-FRONT, and suppose that it searches for

element x. More precisely, we’ll compute

, the

change in the inversion count, which gives a rough idea of how much

MOVE-TO-FRONT’s list becomes more or less like FORESEE’s list.

After searching, MOVE-TO-FRONT performs a series of swaps with

each of the elements on the list

that precedes x. Using the notation

above, the number of such swaps is | BB| + | BA|. Bearing in mind that the list has yet to be changed by the i th call of FORESEE, let’s see how

the inversion count changes.

Consider a swap with an element yBB. Before the swap, y precedes x in both

and

. After the swap, x precedes y in , and

does not change. Therefore, the inversion count increases by 1 for

each element in BB. Now consider a swap with an element zBA.

Before the swap, z precedes x in

but x precedes z in

. After the

swap, x precedes z in both lists. Therefore, the inversion count decreases by 1 for each element in BA. Thus altogether, the inversion count

increases by

We have laid the groundwork needed to analyze MOVE-TO-

FRONT.

Theorem 27.1

Algorithm MOVE-TO-FRONT has a competitive ratio of 4.

Proof The proof uses a potential function, as described in Chapter 16

on amortized analysis. The value Φ i of the potential function after the

i th calls of MOVE-TO-FRONT and FORESEE depends on the

inversion count:

(Intuitively, the factor of 2 embodies the notion that each inversion

represents a cost of 2 for MOVE-TO-FRONT relative to FORESEE: 1

Image 919

Image 920

Image 921

Image 922

Image 923

for searching and 1 for swapping.) By equation (27.7), after the i th call

of MOVE-TO-FRONT, but before the i th call of FORESEE, the

potential increases by 2(| BB| − | BA|). Since the inversion count of the two lists is nonnegative, we have Φ i ≥ 0 for all i ≥ 0. Assuming that MOVE-TO-FRONT and FORESEE start with the same list, the initial

potential Φ0 is 0, so that Φ i ≥ Φ0 for all i.

Drawing from equation (16.2) on page 456, the amortized cost of

the i th MOVE-TO-FRONT operation is

where , the actual cost of the i th MOVE-TO-FRONT operation, is

given by equation (27.3):

Now, let’s consider the potential change Φ i − Φ i−1. Since both LM and LF change, let’s consider the changes to one list at a time. Recall that

when MOVE-TO-FRONT moves element x to the front, it increases the

potential by exactly 2(| BB| − | BA|). We now consider how the optimal

algorithm FORESEE changes its list LF: it performs ti swaps. Each swap performed by FORESEE either increases or decreases the

potential by 2, and thus the increase in potential by FORESEE in the

i th call can be at most 2 ti. We therefore have

We now finish the proof as in Chapter 16 by showing that the total

amortized cost provides an upper bound on the total actual cost,

because the initial potential function is 0 and the potential function is

Image 924

Image 925

always nonnegative. By equation (16.3) on page 456, for any sequence of

m MOVE-TO-FRONT operations, we have

Therefore, we have

Thus the total cost of the m MOVE-TO-FRONT operations is at most 4

times the total cost of the m FORESEE operations, so MOVE-TO-

FRONT is 4-competitive.

Isn’t it amazing that we can compare MOVE-TO-FRONT with the

optimal algorithm FORESEE when we have no idea of the swaps that

FORESEE makes? We were able to relate the performance of MOVE-

TO-FRONT to the optimal algorithm by capturing how particular

properties (swaps in this case) must evolve relative to the optimal

algorithm, without actually knowing the optimal algorithm.

The online algorithm MOVE-TO-FRONT has a competitive ratio of

4: on any input sequence, it incurs a cost at most 4 times that of any

other algorithm. On a particular input sequence, it could cost much less

than 4 times the optimal algorithm, perhaps even matching the optimal

algorithm.

Exercises

27.2-1

You are given a set S = { x 1, x 2, … , xn} of n elements, and you wish to make a static list L (no rearranging once the list is created) containing

Image 926

Image 927

the elements of S that is good for searching. Suppose that you have a

probability distribution, where p( xi) is the probability that a given search searches for element xi. Argue that the expected cost for m searches is

Prove that this sum is minimized when the elements of L are sorted in

decreasing order with respect to p( xi).

27.2-2

Professor Carnac claims that since FORESEE is an optimal algorithm

that knows the future, then at each step it must incur no more cost than

MOVE-TO-FRONT. Either prove that Professor Carnac is correct or

provide a counterexample.

27.2-3

Another way to maintain a linked list for efficient searching is for each

element to maintain a frequency count: the number of times that the

element has been searched for. The idea is to rearrange list elements

after searches so that the list is always sorted by decreasing frequency

count, from largest to smallest. Either show that this algorithm is O(1)-

competitive, or prove that it is not.

27.2-4

The model in this section charged a cost of 1 for each swap. We can

consider an alternative cost model in which, after accessing x, you can

move x anywhere earlier in the list, and there is no cost for doing so.

The only cost is the cost of the actual accesses. Show that MOVE-TO-

FRONT is 2-competitive in this cost model, assuming that the number

requests is sufficiently large. ( Hint: Use the potential function

.)

27.3 Online caching

In Section 15.4, we studied the caching problem, in which blocks of data from the main memory of a computer are stored in the cache: a small

but faster memory. In that section, we studied the offline version of the

problem, in which we assumed that we knew the sequence of memory

requests in advance, and we designed an algorithm to minimize the

number of cache misses. In almost all computer systems, caching is, in

fact, an online problem. We do not generally know the series of cache

requests in advance; they are presented to the algorithm only as the

requests for blocks are actually made. To gain a better understanding of

this more realistic scenario, we analyze online algorithms for caching.

We will first see that all deterministic online algorithms for caching have

a lower bound of Ω( k) for the competitive ratio, where k is the size of the cache. We will then present an algorithm with a competitive ratio of

Θ( n), where the input size is n, and one with a competitive ratio of O( k), which matches the lower bound. We will end by showing how to use

randomization to design an algorithm with a much better competitive

ratio of Θ(lg k). We will also discuss the assumptions that underlie randomized online algorithms, via the notion of an adversary, such as

we saw in Chapter 11 and will see in Chapter 31.

You can find the terminology used to describe the caching problem

in Section 15.4, which you might wish to review before proceeding.

27.3.1 Deterministic caching algorithms

In the caching problem, the input comprises a sequence of n memory

requests, for data in blocks b 1, b 2, … , bn, in that order. The blocks requested are not necessarily distinct: each block may appear multiple

times within the request sequence. After block bi is requested, it resides

in a cache that can hold up to k blocks, where k is a fixed cache size. We assume that n > k, since otherwise we are assured that the cache can hold all the requested blocks at once. When a block bi is requested, if it

is already in the cache, then a cache hit occurs and the cache remains

unchanged. If bi is not in the cache, then a cache miss occurs. If the cache contains fewer than k blocks upon a cache miss, block bi is placed into the cache, which now contains one block more than before. If a

cache miss occurs with an already full cache, however, some block must be evicted from the cache before bi can enter. Thus, a caching algorithm

must decide which block to evict from the cache upon a cache miss

when the cache is full. The goal is to minimize the number of cache

misses over the entire request sequence. The caching algorithms

considered in this chapter differ only in which block they decide to evict

upon a cache miss. We do not consider abilities such as prefetching, in

which a block is brought into the cache before an upcoming request in

order to avert a future cache miss.

There are many online caching policies to determine which block to

evict, including the following:

First-in, first-out (FIFO): evict the block that has been in the

cache the longest time.