races, the P-MAT-VEC-WRONG procedure on the next page is a faulty
parallel implementation of matrix-vector multiplication that achieves a
span of Θ(lg n) by parallelizing the inner for loop. This procedure is incorrect, unfortunately, due to determinacy races when updating yi in
line 3, which executes in parallel for all n values of j.
Index variables of parallel for loops, such as i in line 1 and j in line 2, do not cause races between iterations. Conceptually, each iteration of
the loop creates an independent variable to hold the index of that
iteration during that iteration’s execution of the loop body. Even if two
parallel iterations both access the same index variable, they really are
accessing different variable instances—hence different memory
locations—and no race occurs.
P-MAT-VEC-WRONG ( A, x, y, n)
1 parallel for i = 1 to n
2
parallel for j = 1 to n
3
yi = yi + aijxj
// determinacy race
A parallel algorithm with races can sometimes be deterministic. As
an example, two parallel threads might store the same value into a
shared variable, and it wouldn’t matter which stored the value first. For
simplicity, however, we generally prefer code without determinacy races,
even if the races are benign. And good parallel programmers frown on
code with determinacy races that cause nondeterministic behavior, if
deterministic code that performs comparably is an option.
But nondeterministic code does have its place. For example, you
can’t implement a parallel hash table, a highly practical data structure,
without writing code containing determinacy races. Much research has
centered around how to extend the fork-join model to incorporate
limited “structured” nondeterminism while avoiding the full measure of
complications that arise when nondeterminism is completely
unrestricted.
A chess lesson
To illustrate the power of work/span analysis, this section closes with a
true story that occurred during the development of one of the first
world-class parallel chess-playing programs [106] many years ago. The timings below have been simplified for exposition.
The chess program was developed and tested on a 32-processor
computer, but it was designed to run on a supercomputer with 512
processors. Since the supercomputer availability was limited and
expensive, the developers ran benchmarks on the small computer and
extrapolated performance to the large computer.
At one point, the developers incorporated an optimization into the
program that reduced its running time on an important benchmark on

the small machine from T 32 = 65 seconds to
seconds. Yet, the
developers used the work and span performance measures to conclude
that the optimized version, which was faster on 32 processors, would
actually be slower than the original version on the 512 processors of the
large machine. As a result, they abandoned the “optimization.”
Here is their work/span analysis. The original version of the program
had work T 1 = 2048 seconds and span T∞= 1 second. Let’s treat inequality (26.4) on page 760 as the equation TP = T 1/ P + T∞, which we can use as an approximation to the running time on P processors.
Then indeed we have T 32 = 2048/32 + 1 = 65. With the optimization,
the work becomes T′1 = 1024 seconds, and the span becomes T′∞ = 8
seconds. Our approximation gives T′32 = 1024/32 + 8 = 40.
The relative speeds of the two versions switch when we estimate their
running times on 512 processors, however. The first version has a
running time of T 512 = 2048/512+1 = 5 seconds, and the second version
runs in
seconds. The optimization that speeds up
the program on 32 processors makes the program run for twice as long
on 512 processors! The optimized version’s span of 8, which is not the
dominant term in the running time on 32 processors, becomes the
dominant term on 512 processors, nullifying the advantage from using
more processors. The optimization does not scale up.
The moral of the story is that work/span analysis, and measurements
of work and span, can be superior to measured running times alone in
extrapolating an algorithm’s scalability.
Exercises
26.1-1
What does a trace for the execution of a serial algorithm look like?
26.1-2
Suppose that line 4 of P-FIB spawns P-FIB ( n − 2), rather than calling
it as is done in the pseudocode. How would the trace of P-FIB(4) in
Figure 26.2 change? What is the impact on the asymptotic work, span, and parallelism?
26.1-3
Draw the trace that results from executing P-FIB(5). Assuming that
each strand in the computation takes unit time, what are the work,
span, and parallelism of the computation? Show how to schedule the
trace on 3 processors using greedy scheduling by labeling each strand
with the time step in which it is executed.
26.1-4
Prove that a greedy scheduler achieves the following time bound, which
is slightly stronger than the bound proved in Theorem 26.1:
26.1-5
Construct a trace for which one execution by a greedy scheduler can
take nearly twice the time of another execution by a greedy scheduler on
the same number of processors. Describe how the two executions would
proceed.
26.1-6
Professor Karan measures her deterministic task-parallel algorithm on
4, 10, and 64 processors of an ideal parallel computer using a greedy
scheduler. She claims that the three runs yielded T 4 = 80 seconds, T 10 =
42 seconds, and T 64 = 10 seconds. Argue that the professor is either lying or incompetent. ( Hint: Use the work law (26.2), the span law
(26.3), and inequality (26.5) from Exercise 26.1-4.)
26.1-7
Give a parallel algorithm to multiply an n × n matrix by an n-vector that achieves Θ( n 2/lg n) parallelism while maintaining Θ( n 2) work.
26.1-8
Analyze the work, span, and parallelism of the procedure P-
TRANSPOSE, which transposes an n × n matrix A in place.
P-TRANSPOSE ( A, n)
2
parallel for i = 1 to j − 1
3
exchange aij with aji
26.1-9
Suppose that instead of a parallel for loop in line 2, the P-TRANSPOSE
procedure in Exercise 26.1-8 had an ordinary for loop. Analyze the
work, span, and parallelism of the resulting algorithm.
26.1-10
For what number of processors do the two versions of the chess
program run equally fast, assuming that TP = T 1/ P + T∞?
26.2 Parallel matrix multiplication
In this section, we’ll explore how to parallelize the three matrix-
multiplication algorithms from Sections 4.1 and 4.2. We’ll see that each algorithm can be parallelized in a straightforward fashion using either
parallel loops or recursive spawning. We’ll analyze them using
work/span analysis, and we’ll see that each parallel algorithm attains the
same performance on one processor as its corresponding serial
algorithm, while scaling up to large numbers of processors.
A parallel algorithm for matrix multiplication using parallel loops
The first algorithm we’ll study is P-MATRIX-MULTIPLY, which
simply parallelizes the two outer loops in the procedure MATRIX-
MULTIPLY on page 81.
P-MATRIX-MULTIPLY ( A, B, C, n)
1 parallel for i = 1 to n
// compute entries in each of n rows
2
parallel for j = 1 to n
// compute n entries in row i
3
for k = 1 to n
4
cij = cij + aik · bkj // add in another term of equation (4.1)
Let’s analyze P-MATRIX-MULTIPLY. Since the serial projection of
the algorithm is just MATRIX-MULTIPLY, the work is the same as the
running time of MATRIX-MULTIPLY: T 1( n) = Θ( n 3). The span is T∞( n) = Θ( n), because it follows a path down the tree of recursion for the parallel for loop starting in line 1, then down the tree of recursion
for the parallel for loop starting in line 2, and then executes all n
iterations of the ordinary for loop starting in line 3, resulting in a total
span of Θ(lg n) + Θ(lg n) + Θ( n) = Θ( n). Thus the parallelism is Θ( n 3)/
Θ( n) = Θ( n 2). (Exercise 26.2-3 asks you to parallelize the inner loop to obtain a parallelism of Θ( n 3/lg n), which you cannot do
straightforwardly using parallel for, because you would create races.)
A parallel divide-and-conquer algorithm for matrix multiplication
Section 4.1 shows how to multiply n × n matrices serially in Θ( n 3) time using a divide-and-conquer strategy. Let’s see how to parallelize that
algorithm using recursive spawning instead of calls.
The serial MATRIX-MULTIPLY-RECURSIVE procedure on page
83 takes as input three n × n matrices A, B, and C and performs the matrix calculation C = C + A · B by recursively performing eight multiplications of n/2 × n/2 submatrices of A and B. The P-MATRIX-MULTIPLY-RECURSIVE procedure on the following page
implements the same divide-and-conquer strategy, but it uses spawning
to perform the eight multiplications in parallel. To avoid determinacy
races in updating the elements of C, it creates a temporary matrix D to store four of the submatrix products. At the end, it adds C and D
together to produce the final result. (Problem 26-2 asks you to eliminate
the temporary matrix D at the expense of some parallelism.)
Lines 2–3 of P-MATRIX-MULTIPLY-RECURSIVE handle the
base case of multiplying 1 × 1 matrices. The remainder of the procedure
deals with the recursive case. Line 4 allocates a temporary matrix D, and lines 5–7 zero it. Line 8 partitions each of the four matrices A, B, C, and D into n/2 × n/2 submatrices. (As with MATRIX-MULTIPLY-RECURSIVE on page 83, we’re glossing over the subtle issue of how to
use index calculations to represent submatrix sections of a matrix.) The spawned recursive call in line 9 sets C 11 = C 11 + A 11 · B 11, so that C 11
accumulates the first of the two terms in equation (4.5) on page 82.
Similarly, lines 10–12 cause each of C 12, C 21, and C 22 in parallel to accumulate the first of the two terms in equations (4.6)–(4.8),
respectively. Line 13 sets the submatrix D 11 to the submatrix product
A 12 · B 21, so that D 11 equals the second of the two terms in equation (4.5). Lines 14–16 set each of D 12, D 21, and D 22 in parallel to the second of the two terms in equations (4.6)–(4.8), respectively. The sync
statement in line 17 ensures that all the spawned submatrix products in
lines 9–16 have been computed, after which the doubly nested parallel
for loops in lines 18–20 add the elements of D to the corresponding elements of C.
P-MATRIX-MULTIPLY-RECURSIVE ( A, B, C, n)
1 if n == 1
// just one element in each matrix?
2
c 11 = c 11 + a 11 · b 11
3
return
4 let D be a new n × n matrix// temporary matrix
5 parallel for i = 1 to n
// set D = 0
6
parallel for j = 1 to n
7
dij = 0
8 partition A, B, C, and D into n/2 × n/2 submatrices A 11, A 12, A 21, A 22; B 11, B 12, B 21, B 22; C 11, C 12, C 21, C 22; and D 11, D 12, D 21, D 22; respectively
9 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 11, B 11, C 11,
n/2)
10 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 11, B 12, C 12,
n/2)
11 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 21, B 11, C 21,
n/2)
12 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 21, B 12, C 22,
13 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 12, B 21, D 11,
n/2)
14 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 12, B 22, D 12,
n/2)
15 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 22, B 21, D 21,
n/2)
16 spawn P-MATRIX-MULTIPLY-RECURSIVE ( A 22, B 22, D 22,
n/2)
17 sync
// wait for spawned submatrix products
18 parallel for i = 1 to n
// update C = C + D
19
parallel for j = 1 to n
20
cij = cij + dij
Let’s
analyze
the
P-MATRIX-MULTIPLY-RECURSIVE
procedure. We start by analyzing the work M 1( n), echoing the serial running-time analysis of its progenitor MATRIX-MULTIPLY-RECURSIVE. The recursive case allocates and zeros the temporary
matrix D in Θ( n 2) time, partitions in Θ(1) time, performs eight recursive multiplications of n/2 × n/2 matrices, and finishes up with the Θ( n 2) work from adding two n× n matrices. Thus the work outside the
spawned recursive calls is Θ( n 2), and the recurrence for the work M 1( n) becomes
M 1( n) = 8 M 1( n/2) + Θ( n 2)
= Θ( n 3)
by case 1 of the master theorem (Theorem 4.1). Not surprisingly, the
work of this parallel algorithm is asymptotically the same as the
running time of the procedure MATRIX-MULTIPLY on page 81, with
its triply nested loops.
Let’s determine the span M∞( n) of P-MATRIX-MULTIPLY-
RECURSIVE. Because the eight parallel recursive spawns all execute
on matrices of the same size, the maximum span for any recursive spawn
is just the span of a single one of them, or M∞( n/2). The span for the
doubly nested parallel for loops in lines 5–7 is Θ(lg n) because each loop
control adds Θ(lg n) to the constant span of line 7. Similarly, the doubly
nested parallel for loops in lines 18–20 add another Θ(lg n). Matrix partitioning by index calculation has Θ(1) span, which is dominated by
the Θ(lg n) span of the nested loops. We obtain the recurrence
Since this recurrence falls under case 2 of the master theorem with k =
1, the solution is M∞( n) = Θ(lg2 n).
The parallelism of P-MATRIX-MULTIPLY-RECURSIVE is
M 1( n)/ M∞( n) = Θ( n 3/lg2 n), which is huge. (Problem 26-2 asks you to simplify this parallel algorithm at the expense of just a little less
parallelism.)
Parallelizing Strassen’s method
To parallelize Strassen’s algorithm, we can follow the same general
outline as on pages 86–87, but use spawning. You may find it helpful to
compare each step below with the corresponding step there. We’ll
analyze costs as we go along to develop recurrences T 1( n) and T∞( n) for the overall work and span, respectively.
1. If n = 1, the matrices each contain a single element. Perform a
single scalar multiplication and a single scalar addition, and
return. Otherwise, partition the input matrices A and B and
output matrix C into n/2 × n/2 submatrices, as in equation (4.2) on page 82. This step takes Θ(1) work and Θ(1) span by index
calculation.
2. Create n/2 × n/2 matrices S 1, S 2, … , S 10, each of which is the sum or difference of two submatrices from step 1. Create and
zero the entries of seven n/2× n/2 matrices P 1, P 2, … , P 7 to hold seven n/2× n/2 matrix products. All 17 matrices can be created,
and the Pi initialized, with doubly nested parallel for loops using Θ( n 2) work and Θ(lg n) span.
3. Using the submatrices from step 1 and the matrices S 1, S 2, … ,
S 10 created in step 2, recursively spawn computations of each of
the seven n/2 × n/2 matrix products P 1, P 2, … , P 7, taking 7 T 1( n/2) work and T∞( n/2) span.
4. Update the four submatrices C 11, C 12, C 21, C 22 of the result matrix C by adding or subtracting various Pi matrices. Using
doubly nested parallel for loops, computing all four submatrices
takes Θ( n 2) work and Θ(lg n) span.
Let’s analyze this algorithm. Since the serial projection is the same as
the original serial algorithm, the work is just the running time of the
serial projection, namely, Θ( n lg 7). As we did with P-MATRIX-
MULTIPLY-RECURSIVE, we can devise a recurrence for the span. In
this case, seven recursive calls execute in parallel, but since they all
operate on matrices of the same size, we obtain the same recurrence
(26.6) as we did for P-MATRIX-MULTIPLY-RECURSIVE, with
solution Θ(lg2 n). Thus the parallel version of Strassen’s method has parallelism Θ( n lg 7/lg2 n), which is large. Although the parallelism is slightly less than that of P-MATRIX-MULTIPLY-RECURSIVE, that’s
just because the work is also less.
Exercises
26.2-1
Draw the trace for computing P-MATRIX-MULTIPLY on 2 × 2
matrices, labeling how the vertices in your diagram correspond to
strands in the execution of the algorithm. Assuming that each strand
executes in unit time, analyze the work, span, and parallelism of this
computation.
26.2-2
Repeat Exercise 26.2-1 for P-MATRIX-MULTIPLY-RECURSIVE.
26.2-3
Give pseudocode for a parallel algorithm that multiplies two n × n matrices with work Θ( n 3) but span only Θ(lg n). Analyze your algorithm.
26.2-4
Give pseudocode for an efficient parallel algorithm that multiplies a p ×
q matrix by a q × r matrix. Your algorithm should be highly parallel even if any of p, q, and r equal 1. Analyze your algorithm.
26.2-5
Give pseudocode for an efficient parallel version of the Floyd-Warshall
algorithm (see Section 23.2), which computes shortest paths between all pairs of vertices in an edge-weighted graph. Analyze your algorithm.
We first saw serial merge sort in Section 2.3.1, and in Section 2.3.2 we analyzed its running time and showed it to be Θ( n lg n). Because merge
sort already uses the divide-and-conquer method, it seems like a terrific
candidate for implementing using fork-join parallelism.
The procedure P-MERGE-SORT modifies merge sort to spawn the
first recursive call. Like its serial counterpart MERGE-SORT on page
39, the P-MERGE-SORT procedure sorts the subarray A[ p : r]. After the sync statement in line 8 ensures that the two recursive spawns in
lines 5 and 7 have finished, P-MERGE-SORT calls the P-MERGE
procedure, a parallel merging algorithm, which is on page 779, but you
don’t need to bother looking at it right now.
P-MERGE-SORT ( A, p, r)
1 if p ≥ r
// zero or one element?
2
return
3 q = ⌊( p + r)/2⌊
// midpoint of A[ p : r]
4 // Recursively sort A[ p : q] in parallel.
5 spawn P-MERGE-SORT ( A, p, q)
6 // Recursively sort A[ q + 1 : r] in parallel.
7 spawn P-MERGE-SORT ( A, q + 1, r)
8 sync
// wait for spawns
9 // Merge A[ p : q] and A[ q + 1 : r] into A[ p : r].
10 P-MERGE ( A, p, q, r)
First, let’s use work/span analysis to get some intuition for why we
need a parallel merge procedure. After all, it may seem as though there
should be plenty of parallelism just by parallelizing MERGE-SORT
without worrying about parallelizing the merge. But what would happen
if the call to P-MERGE in line 10 of P-MERGE-SORT were replaced
by a call to the serial MERGE procedure on page 36? Let’s call the
pseudocode so modified P-NAIVE-MERGE-SORT.
Let T 1( n) be the (worst-case) work of P-NAIVE-MERGE-SORT on
an n-element subarray, where n = r − p + 1 is the number of elements in A[ p : r], and let T∞( n) be the span. Because MERGE is serial with running time Θ( n), both its work and span are Θ( n). Since the serial projection of P-NAIVE-MERGE-SORT is exactly MERGE-SORT, its
work is T 1( n) = Θ( n lg n). The two recursive calls in lines 5 and 7 run in parallel, and so its span is given by the recurrence
T∞( n) = T∞( n/2) + Θ( n)
= Θ( n),
by case 1 of the master theorem. Thus the parallelism of P-NAIVE-
MERGE-SORT is T 1( n)/ T∞( n) = Θ(lg n), which is an unimpressive amount of parallelism. To sort a million elements, for example, since lg
106 ≈ 20, it might achieve linear speedup on a few processors, but it
would not scale up to dozens of processors.
The parallelism bottleneck in P-NAIVE-MERGE-SORT is plainly
the MERGE procedure. If we asymptotically reduce the span of
merging, the master theorem dictates that the span of parallel merge
sort will also get smaller. When you look at the pseudocode for
MERGE, it may seem that merging is inherently serial, but it’s not. We can fashion a parallel merging algorithm. The goal is to reduce the span
of parallel merging asymptotically, but if we want an efficient parallel
algorithm, we must ensure that the Θ( n) bound on work doesn’t
increase.
Figure 26.6 depicts the divide-and-conquer strategy that we’ll use in P-MERGE. The heart of the algorithm is a recursive auxiliary
procedure P-MERGE-AUX that merges two sorted subarrays of an
array A into a subarray of another array B in parallel. Specifically, P-MERGE-AUX merges A[ p 1 : r 1] and A[ p 2 : r 2] into subarray B[ p 3 : r 3], where r 3 = p 3 + ( r 1 − p 1 + 1) + ( r 2 − p 2 + 1) − 1 = p 3 + ( r 1 − p 1) + ( r 2
− p 2) + 1.
The key idea of the recursive merging algorithm in P-MERGE-AUX
is to split each of the two sorted subarrays of A around a pivot x, such that all the elements in the lower part of each subarray are at most x
and all the elements in the upper part of each subarray are at least x.
The procedure can then recurse in parallel on two subtasks: merging the
two lower parts, and merging the two upper parts. The trick is to find a
pivot x so that the recursion is not too lopsided. We don’t want a situation such as that in QUICKSORT on page 183, where bad
partitioning elements lead to a dramatic loss of asymptotic efficiency.
We could opt to partition around a random element, as
RANDOMIZED-QUICKSORT on page 192 does, but because the
input subarrays are sorted, P-MERGE-AUX can quickly determine a
pivot that always works well.
Specifically, the recursive merging algorithm picks the pivot x as the
middle element of the larger of the two input subarrays, which we can
assume without loss of generality is A[ p 1 : r 1], since otherwise, the two subarrays can just switch roles. That is, x = A[ q 1], where q 1 = ⌊( p 1 +
r 1)/2⌊. Because A[ p 1 : r 1] is sorted, x is a median of the subarray elements: every element in A[ p 1 : q 1 − 1] is no more than x, and every element in A[ q 1 + 1 : r 1] is no less than x. Then the algorithm finds the
“split point” q 2 in the smaller subarray A[ p 2 : r 2] such that all the
elements in A[ p 2 : q 2−1] (if any) are at most x and all the elements in A[ q 2 : r 2] (if any) are at least x. Intuitively, the subarray A[ p 2 : r 2] would still be sorted if x were inserted between A[ q 2−1] and A[ q 2] (although the algorithm doesn’t do that). Since A[ p 2 : r 2] is sorted, a minor variant of binary search (see Exercise 2.3-6) with x as the search key can find
the split point q 2 in Θ(lg n) time in the worst case. As we’ll see when we get to the analysis, even if x splits A[ p 2 : r 2] badly— x is either smaller than all the subarray elements or larger—we’ll still have at least 1/4 of
the elements in each of the two recursive merges. Thus the larger of the
recursive merges operates on at most 3/4 elements, and the recursion is
guaranteed to bottom out after Θ(lg n) recursive calls.
Figure 26.6 The idea behind P-MERGE-AUX, which merges two sorted subarrays A[ p 1 : r 1]
and A[ p 2 : r 2] into the subarray B[ p 3 : r 3] in parallel. Letting x = A[ q 1] (shown in yellow) be a median of A[ p 1 : r 1] and q 2 be a place in A[ p 2 : r 2] such that x would fall between A[ q 2 − 1] and A[ q 2], every element in the subarrays A[ p 1 : q 1 − 1] and A[ p 2 : q 2 − 1] (shown in orange) is at most x, and every element in the subarrays A[ q 1 + 1 : r 1] and A[ q 2 + 1 : r 2] (shown in blue) is at least x. To merge, compute the index q 3 where x belongs in B[ p 3 : r 3], copy x into B[ q 3], and then recursively merge A[ p 1 : q 1 − 1] with A[ p 2 : q 2 − 1] into B[ p 3 : q 3 − 1] and A[ q 1 + 1 : r 1]
with A[ q 2 : r 2] into B[ q 3 + 1 : r 3].
Now let’s put these ideas into pseudocode. We start with the serial
procedure FIND-SPLIT-POINT ( A, p, r, x) on the next page, which takes as input a sorted subarray A[ p : r] and a key x. The procedure returns a split point of A[ p : r]: an index q in the range p ≤ q ≤ r + 1 such
that all the elements in A[ p : q − 1] (if any) are at most x and all the elements in A[ q : r] (if any) are at least x.
The FIND-SPLIT-POINT procedure uses binary search to find the
split point. Lines 1 and 2 establish the range of indices for the search.
Each time through the while loop, line 5 compares the middle element of
the range with the search key x, and lines 6 and 7 narrow the search range to either the lower half or the upper half of the subarray,
depending on the result of the test. In the end, after the range has been
narrowed to a single index, line 8 returns that index as the split point.
FIND-SPLIT-POINT ( A, p, r, x)
1 low = p
// low end of search range
2 high = r + 1
// high end of search range
3 while low < high
// more than one element?
4
mid = ⌊( low + high)/2⌊ // midpoint of range
5
if x ≤ A[ mid]
// is answer q ≤ mid?
6
high = mid
// narrow search to A[ low : mid]
7
else low = mid + 1
// narrow search to A[ mid + 1 : high]
8 return low
Because FIND-SPLIT-POINT contains no parallelism, its span is
just its serial running time, which is also its work. On a subarray A[ p : r]
of size n = r − p + 1, each iteration of the while loop halves the search range, which means that the loop terminates after Θ(lg n) iterations.
Since each iteration takes constant time, the algorithm runs in Θ(lg n)
(worst-case) time. Thus the procedure has work and span Θ(lg n).
Let’s now look at the pseudocode for the parallel merging procedure
P-MERGE on the next page. Most of the pseudocode is devoted to the
recursive procedure P-MERGE-AUX. The procedure P-MERGE itself
is just a “wrapper” that sets up for P-MERGE-AUX. It allocates a new
array B[ p : r] to hold the output of P-MERGE-AUX in line 1. It then calls P-MERGE-AUX in line 2, passing the indices of the two subarrays
to be merged and providing B as the output destination of the merged
result, starting at index p. After P-MERGE-AUX returns, lines 3–4
perform a parallel copy of the output B[ p : r] into the subarray A[ p : r], which is where P-MERGE-SORT expects it.
The P-MERGE-AUX procedure is the interesting part of the
algorithm. Let’s start by understanding the parameters of this recursive
parallel procedure. The input array A and the four indices p 1, r 1, p 2, r 2
specify the subarrays A[ p 1 : r 1] and A[ p 2 : r 2] to be merged. The array B
and the index p 3 indicate that the merged result should be stored into
B[ p 3 : r 3], where r 3 = p 3 + ( r 1 − p 1)+ ( r 2 − p 2)+ 1, as we saw earlier.
The end index r 3 of the output subarray is not needed by the
pseudocode, but it helps conceptually to name the end index, as in the
comment in line 13.
The procedure begins by checking the base case of the recursion and
doing some bookkeeping to simplify the rest of the pseudocode. Lines 1
and 2 test whether the two subarrays are both empty, in which case the
procedure returns. Line 3 checks whether the first subarray contains
fewer elements than the second subarray. Since the number of elements
in the first subarray is r 1 − p 1 + 1 and the number in the second subarray is r 2 − p 2 + 1, the test omits the two “+1’s.” If the first subarray is the smaller of the two, lines 4 and 5 switch the roles of the
subarrays so that A[ p 1, r 1] refers to the larger subarray for the balance of the procedure.
P-MERGE ( A, p, q, r)
1 let B[ p : r] be a new array
// allocate scratch array
2 P-MERGE-AUX ( A, p, q, q + 1, r, B,// merge from A into B
p)
3 parallel for i = p to r
// copy B back to A in
parallel
4
A[ i] = B[ i]
P-MERGE-AUX ( A, p 1, r 1, p 2, r 2, B, p 3) 1 if p 1 > r 1 and p 2 > r 2
// are both subarrays empty?
2
return
// second subarray bigger?
4
exchange p 1 with p 2
// swap subarray roles
5
exchange r 1 with r 2
6 q 1 = ⌊( p 1 + r 1)/2⌊
// midpoint of A[ p 1 : r 1]
7 x = A[ q 1]
// median of A[ p 1 : r 1] is
pivot x
8 q 2 = FIND-SPLIT-POINT ( A, p 2,// split A[ p 2 : r 2] around x r 2, x)
9 q 3 = p 3 + ( q 1 − p 1) + ( q 2 − p 2)
// where x belongs in B …
10 B[ q 3] = x
// … put it there
11 // Recursively merge A[ p 1 : q 1 − 1] and A[ p 2 : q 2 − 1] into B[ p 3 : q 3
− 1].
12 spawn P-MERGE-AUX ( A, p 1, q 1 − 1, p 2, q 2 − 1, B, p 3) 13 // Recursively merge A[ q 1 + 1 : r 1] and A[ q 2 : r 2] into B[ q 3 + 1 : r 3].
14 spawn P-MERGE-AUX ( A, q 1 + 1, r 1, q 2, r 2, B, q 3 + 1) 15 sync
// wait for spawns
We’re now at the crux of P-MERGE-AUX: implementing the
parallel divide-and-conquer strategy. As we continue our pseudocode
walk, you may find it helpful to refer again to Figure 26.6.
First the divide step. Line 6 computes the midpoint q 1 of A[ p 1 : r 1], which indexes a median x = A[ q 1] of this subarray to be used as the pivot, and line 7 determines x itself. Next, line 8 uses the FIND-SPLIT-POINT procedure to find the index q 2 in A[ p 2 : r 2] such that all elements in A[ p 2 : q 2 − 1] are at most x and all the elements in A[ q 2 : r 2]
are at least x. Line 9 computes the index q 3 of the element that divides the output subarray B[ p 3 : r 3] into B[ p 3 : q 3 − 1] and B[ q 3 + 1 : r 3], and then line 10 puts x directly into B[ q 3], which is where it belongs in the output.
Next is the conquer step, which is where the parallel recursion occurs. Lines 12 and 14 each spawn P-MERGE-AUX to recursively
merge from A into B, the first to merge the smaller elements and the second to merge the larger elements. The sync statement in line 15
ensures that the subproblems finish before the procedure returns.
There is no combine step, as B[ p : r] already contains the correct sorted output.
Work/span analysis of parallel merging
Let’s first analyze the worst-case span T∞( n) of P-MERGE-AUX on input subarrays that together contain a total of n elements. The call to
FIND-SPLIT-POINT in line 8 contributes Θ(lg n) to the span in the
worst case, and the procedure performs at most a constant amount of
additional serial work outside of the two recursive spawns in lines 12
and 14.
Because the two recursive spawns operate logically in parallel, only
one of them contributes to the overall worst-case span. We claimed
earlier that neither recursive invocation ever operates on more than 3 n/4
elements. Let’s see why. Let n 1 = r 1 − p 1 + 1 and n 2 = r 2 − p 2 + 1, where n = n 1 + n 2, be the sizes of the two subarrays when line 6 starts executing, that is, after we have established that n 2 ≤ n 1 by swapping the roles of the two subarrays, if necessary. Since the pivot x is a median of
of A[ p 1 : r 1], in the worst case, a recursive merge involves at most n 1/2
elements of A[ p 1 : r 1], but it might involve all n 2 of the elements of A[ p 2
: r 2]. Thus we can bound the number of elements involved in a recursive
invocation of P-MERGE-AUX by
n 1/2 + n 2 = (2 n 1 + 4 n 2)/4
≤ (3 n 1 + 3 n 2)/4 (since n 2 ≤ n 1)
= 3 n/4,
proving the claim.
The worst-case span of P-MERGE-AUX can therefore be described
by the following recurrence:

Because this recurrence falls under case 2 of the master theorem with k
= 1, its solution is T∞( n) = Θ(lg 2 n).
Now let’s verify that the work T 1( n) of P-MERGE-AUX on n elements is linear. A lower bound of Ω( n) is straightforward, since each
of the n elements is copied from array A to array B. We’ll show that T 1( n) = O( n) by deriving a recurrence for the worst-case work. The binary search in line 8 costs Θ(lg n) in the worst case, which dominates
the other work outside of the recursive spawns. For the recursive
spawns, observe that although lines 12 and 14 might merge different
numbers of elements, the two recursive spawns together merge at most n
− 1 elements (since x = A[ q] is not merged). Moreover, as we saw when analyzing the span, a recursive spawn operates on at most 3 n/4 elements.
We therefore obtain the recurrence
where α lies in the range 1/4 ≤ α ≤ 3/4. The value of α can vary from one recursive invocation to another.
We’ll use the substitution method (see Section 4.3) to prove that the above recurrence (26.8) has solution T 1( n) = O( n). (You could also use the Akra-Bazzi method from Section 4.7.) Assume that T 1( n) ≤ c 1 n − c 2
lg n for some positive constants c 1 and c 2. Using the properties of logarithms on pages 66–67—in particular, to deduce that lg α + lg(1 −
α) = −Θ(1)—substitution yields
T 1( n) ≤ ( c 1 αn − c 2 lg( αn)) + ( c 1(1 − α) n − c 2 lg((1 − α) n)) + Θ(lg n)
= c 1( α + (1 − α)) n − c 2(lg( αn) + lg((1 − α) n)) + Θ(lg n)
= c 1 n − c 2(lg α + lg n + lg(1 − α) + lg n) + Θ(lg n)
= c 1 n − c 2 lg n − c 2(lg n + lg α + lg(1 − α)) + Θ(lg n)
= c 1 n − c 2 lg n − c 2(lg n − Θ(1)) + Θ(lg n)
≤ c 1 n − c 2 lg n,
if we choose c 2 large enough that the c 2(lg n − Θ(1)) term dominates the Θ(lg n) term for sufficiently large n. Furthermore, we can choose c 1
large enough to satisfy the implied Θ(1) base cases of the recurrence,
completing the induction. The lower and upper bounds of Ω( n) and
O( n) give T 1( n) = Θ( n), asymptotically the same work as for serial merging.
The execution of the pseudocode in the P-MERGE procedure itself
does not add asymptotically to the work and span of P-MERGE-AUX.
The parallel for loop in lines 3–4 has Θ(lg n) span due to the loop control, and each iteration runs in constant time. Thus the Θ(lg2 n) span
of P-MERGE-AUX dominates, yielding Θ(lg2 n) span overall for P-
MERGE. The parallel for loop contains Θ( n) work, matching the
asymptotic work of P-MERGE-AUX and yielding Θ( n) work overall
for P-MERGE.
Analysis of parallel merge sort
The “heavy lifting” is done. Now that we have determined the work and
span of P-MERGE, we can analyze P-MERGE-SORT. Let T 1( n) and
T∞( n) be the work and span, respectively, of P-MERGE-SORT on an
array of n elements. The call to P-MERGE in line 10 of P-MERGE-
SORT dominates the costs of lines 1–3, for both work and span. Thus
we obtain the recurrence
T 1( n) = 2 T 1( n/2) + Θ( n)
for the work of P-MERGE-SORT, and we obtain the recurrence
T∞( n) = T∞( n/2) + Θ(lg2 n)
for its span. The work recurrence has solution T 1( n) = Θ( n lg n) by case 2 of the master theorem with k = 0. The span recurrence has solution
T∞ ( n) = Θ(lg3 n), also by case 2 of the master theorem, but with k = 2.
Parallel merging gives P-MERGE-SORT a parallelism advantage
over P-NAIVE-MERGE-SORT. The parallelism of P-NAIVE-
MERGE-SORT, which calls the serial MERGE procedure, is only Θ(lg
n). For P-MERGE-SORT, the parallelism is
T 1( n)/ T∞( n) = Θ( n lg n)/Θ(lg3 n)
= Θ( n/lg2 n),
which is much better, both in theory and in practice. A good
implementation in practice would sacrifice some parallelism by
coarsening the base case in order to reduce the constants hidden by the
asymptotic notation. For example, you could switch to an efficient serial
sort, perhaps quicksort, when the number of elements to be sorted is
sufficiently small.
Exercises
26.3-1
Explain how to coarsen the base case of P-MERGE.
26.3-2
Instead of finding a median element in the larger subarray, as P-
MERGE does, suppose that the merge procedure finds a median of all
the elements in the two sorted subarrays using the result of Exercise 9.3-
10. Give pseudocode for an efficient parallel merging procedure that
uses this median-finding procedure. Analyze your algorithm.
26.3-3
Give an efficient parallel algorithm for partitioning an array around a
pivot, as is done by the PARTITION procedure on page 184. You need
not partition the array in place. Make your algorithm as parallel as
possible. Analyze your algorithm. ( Hint: You might need an auxiliary
array and might need to make more than one pass over the input
elements.)
26.3-4
Give a parallel version of FFT on page 890. Make your implementation
as parallel as possible. Analyze your algorithm.
Show how to parallelize SELECT from Section 9.3. Make your implementation as parallel as possible. Analyze your algorithm.
Problems
26-1 Implementing parallel loops using recursive spawning
Consider the parallel procedure SUM-ARRAYS for performing
pairwise addition on n-element arrays A[1 : n] and B[1 : n], storing the sums in C [1 : n].
SUM-ARRAYS ( A, B, C, n)
1 parallel for i = 1 to n
2
C [ i] = A[ i] + B[ i]
a. Rewrite the parallel loop in SUM-ARRAYS using recursive spawning
in the manner of P-MAT-VEC-RECURSIVE. Analyze the
parallelism.
Consider another implementation of the parallel loop in SUM-
ARRAYS given by the procedure SUM-ARRAYS′, where the value
grain- size must be specified.
SUM-ARRAYS′( A, B, C, n)
1 grain- size = ?
// to be determined
2 r = ⌈ n/ grain- size⌉
3 for k = 0 to r − 1
4
spawn ADD-SUBARRAY ( A, B, C, k · grain- size + 1, min {( k + 1) · grain- size, n})
5 sync
ADD-SUBARRAY ( A, B, C, i, j)
1 for k = i to j
2
C [ k] = A[ k] + B[ k]
b. Suppose that you set grain- size = 1. What is the resulting parallelism?
c. Give a formula for the span of SUM-ARRAYS′ in terms of n and
grain- size. Derive the best value for grain- size to maximize parallelism.
26-2 Avoiding a temporary matrix in recursive matrix multiplication
The P-MATRIX-MULTIPLY-RECURSIVE procedure on page 772
must allocate a temporary matrix D of size n × n, which can adversely affect the constants hidden by the Θ-notation. The procedure has high
parallelism, however: Θ( n 3/log2 n). For example, ignoring the constants in the Θ-notation, the parallelism for multiplying 1000 × 1000 matrices
comes to approximately 10003/102 = 107, since lg 1000 ≈ 10. Most
parallel computers have far fewer than 10 million processors.
a. Parallelize MATRIX-MULTIPLY-RECURSIVE without using
temporary matrices so that it retains its Θ( n 3) work. ( Hint: Spawn the recursive calls, but insert a sync in a judicious location to avoid races.)
b. Give and solve recurrences for the work and span of your
implementation.
c. Analyze the parallelism of your implementation. Ignoring the
constants in the Θ-notation, estimate the parallelism on 1000 × 1000
matrices. Compare with the parallelism of P-MATRIX-MULTIPLY-
RECURSIVE, and discuss whether the trade-off would be
worthwhile.
26-3 Parallel matrix algorithms
Before attempting this problem, it may be helpful to read Chapter 28.
a. Parallelize the LU-DECOMPOSITION procedure on page 827 by
giving pseudocode for a parallel version of this algorithm. Make your
implementation as parallel as possible, and analyze its work, span,
and parallelism.
b. Do the same for LUP-DECOMPOSITION on page 830.
c. Do the same for LUP-SOLVE on page 824.
d. Using equation (28.14) on page 835, write pseudocode for a parallel algorithm to invert a symmetric positive-definite matrix. Make your
implementation as parallel as possible, and analyze its work, span,
and parallelism.
26-4 Parallel reductions and scan (prefix) computations
A ⊗- reduction of an array x[1 : n], where ⊗ is an associative operator, is the value y = x[1] ⊗ x[2] ⊗ ⋯ ⊗ x[ n]. The REDUCE procedure computes the ⊗-reduction of a subarray x[ i : j] serially.
REDUCE ( x, i, j)
1 y = x[ i]
2for k = i + 1 to j
3
y = y ⊗ x[ k]
4return y
a. Design and analyze a parallel algorithm P-REDUCE that uses
recursive spawning to perform the same function with Θ( n) work and
Θ(lg n) span.
A related problem is that of computing a ⊗ -scan, sometimes called a
⊗ -prefix computation, on an array x[1 : n], where ⊗ is once again an associative operator. The ⊗-scan, implemented by the serial procedure
SCAN, produces the array y[1 : n] given by
y[1] = x[1],
y[2] = x[1] ⊗ x[2],
y[3] = x[1] ⊗ x[2] ⊗ x[3],
⋮
y[ n] = x[1] ⊗ x[2] ⊗ x[3] ⊗ ⋯ ⊗ x[ n], that is, all prefixes of the array x “summed” using the ⊗ operator.
SCAN ( x, n)
1 let y[1 : n] be a new array
3 for i = 2 to n
4
y[ i] = y[ i − 1] ⊗ 1 ⊗ x[ i]
5 return y
Parallelizing SCAN is not straightforward. For example, simply
changing the for loop to a parallel for loop would create races, since
each iteration of the loop body depends on the previous iteration. The
procedures P-SCAN-1 and P-SCAN-1-AUX perform the ⊗-scan in
parallel, albeit inefficiently.
P-SCAN-1( x, n)
1 let y[1] : n be a new array
2 P-SCAN-1-AUX ( x, y, 1, n)
3 return y
P-SCAN-1-AUX ( x, y, i, j)
1 parallel for l = i to j
2
y[ l] = P-REDUCE ( x, 1, l)
b. Analyze the work, span, and parallelism of P-SCAN-1.
The procedures P-SCAN-2 and P-SCAN-2-AUX use recursive
spawning to perform a more efficient ⊗-scan.
P-SCAN-2( x, n)
1 let y[1] : n be a new array
2 P-SCAN-2-AUX ( x, y, 1, n)
3 return y
P-SCAN-2-AUX ( x, y, i, j)
1 if i == j
2
y[ i] = x[ i]
3 else k = ⌊( i + j)/2⌊
4
spawn P-SCAN-2-AUX ( x, y, i, k)
5
P-SCAN-2-AUX ( x, y, k + 1, j)
sync
7
parallel for l = k + 1 to j
8
y[ l] = y[ k] ⊗ y[ l]
c. Argue that P-SCAN-2 is correct, and analyze its work, span, and
parallelism.
To improve on both P-SCAN-1 and P-SCAN-2, perform the ⊗-scan
in two distinct passes over the data. The first pass gathers the terms for
various contiguous subarrays of x into a temporary array t, and the second pass uses the terms in t to compute the final result y. The pseudocode in the procedures P-SCAN-3, P-SCAN-UP, and P-SCAN-DOWN on the facing page implements this strategy, but certain
expressions have been omitted.
d. Fill in the three missing expressions in line 8 of P-SCAN-UP and
lines 5 and 6 of P-SCAN-DOWN. Argue that with the expressions
you supplied, P-SCAN-3 is correct. ( Hint: Prove that the value v
passed to P-SCAN-DOWN ( v, x, t, y, i, j) satisfies v = x[1] ⊗ x[2] ⊗
⋯ ⊗ x[ i − 1].)
e. Analyze the work, span, and parallelism of P-SCAN-3.
f. Describe how to rewrite P-SCAN-3 so that it doesn’t require the use
of the temporary array t.
★ g. Give an algorithm P-SCAN-4( x, n) for a scan that operates in place. It should place its output in x and require only constant
auxiliary storage.
h. Describe an efficient parallel algorithm that uses a +-scan to
determine whether a string of parentheses is well formed. For
example, the string ( ( ) ( ) ) ( ) is well formed, but the string ( ( ) ) ) ( ( )
is not. ( Hint: Interpret ( as a 1 and ) as a −1, and then perform a +-
scan.)
P-SCAN-3( x, n)
1 let y[1] : n and t[1 : n] be new arrays
3 if n > 1
4
P-SCAN-UP ( x, t, 2, n)
5
P-SCAN-DOWN ( x[1], x, t, y, 2, n)
6 return y
P-SCAN-UP ( x, t, i, j)
1 if i == j
2
return x[ i]
3 else
4
k = ⌊( i + j)/2⌊
5
t[ k] = spawn P-SCAN-UP ( x, t, i, k)
6
right = P-SCAN-UP ( x, t, k + 1, j)
7
sync
8
return ____
// fill in the blank
P-SCAN-DOWN ( v, x, t, y, i, j)
1 if i == j
2
y[ i] = v ⊗ x[ i]
3 else
4
k = ⌊( i + j)/2⌊
5
spawn P-SCAN-DOWN (____, x, t, y, i,// fill in the
k)
blank
6
P-SCAN-DOWN (____, x, t, y, k + 1, j) // fill in the blank
7
sync
26-5 Parallelizing a simple stencil calculation
Computational science is replete with algorithms that require the entries
of an array to be filled in with values that depend on the values of
certain already computed neighboring entries, along with other
information that does not change over the course of the computation.
The pattern of neighboring entries does not change during the
computation and is called a stencil. For example, Section 14.4 presents a stencil algorithm to compute a longest common subsequence, where the
value in entry c[ i, j] depends only on the values in c[ i − 1, j], c[ i, j − 1], and c[ i − 1, j − 1], as well as the elements xi and yj within the two sequences given as inputs. The input sequences are fixed, but the
algorithm fills in the two-dimensional array c so that it computes entry
c[ i, j] after computing all three entries c[ i − 1, j], c[ i, j − 1], and c[ i − 1, j
− 1].This problem examines how to use recursive spawning to parallelize
a simple stencil calculation on an n × n array A in which the value placed into entry A[ i, j] depends only on values in A[ i′, j′], where i′ ≤ i and j′ ≤ j (and of course, i′ ≠ i or j′ ≠ j). In other words, the value in an entry depends only on values in entries that are above it and/or to its
left, along with static information outside of the array. Furthermore, we
assume throughout this problem that once the entries upon which A[ i, j]
depends have been filled in, the entry A[ i, j] can be computed in Θ(1) time (as in the LCS-LENGTH procedure of Section 14.4).
Partition the n × n array A into four n/2 × n/2 subarrays as follows: You can immediately fill in subarray A 11 recursively, since it does not
depend on the entries in the other three subarrays. Once the
computation of A 11 finishes, you can fill in A 12 and A 21 recursively in parallel, because although they both depend on A 11, they do not
depend on each other. Finally, you can fill in A 22 recursively.
a. Give parallel pseudocode that performs this simple stencil calculation
using a divide-and-conquer algorithm SIMPLE-STENCIL based on
the decomposition (26.9) and the discussion above. (Don’t worry
about the details of the base case, which depends on the specific
stencil.) Give and solve recurrences for the work and span of this
algorithm in terms of n. What is the parallelism?
b. Modify your solution to part (a) to divide an n × n array into nine n/3
× n/3 subarrays, again recursing with as much parallelism as possible.
Analyze this algorithm. How much more or less parallelism does this
algorithm have compared with the algorithm from part (a)?
c. Generalize your solutions to parts (a) and (b) as follows. Choose an
integer b ≥ 2. Divide an n × n array into b 2 subarrays, each of size n/ b
× n/ b, recursing with as much parallelism as possible. In terms of n and b, what are the work, span, and parallelism of your algorithm?
Argue that, using this approach, the parallelism must be o( n) for any
choice of b ≥ 2. ( Hint: For this argument, show that the exponent of n in the parallelism is strictly less than 1 for any choice of b ≥ 2.)
d. Give pseudocode for a parallel algorithm for this simple stencil
calculation that achieves Θ( n/lg n) parallelism. Argue using notions of work and span that the problem has Θ( n) inherent parallelism.
Unfortunately, simple fork-join parallelism does not let you achieve
this maximal parallelism.
26-6 Randomized parallel algorithms
Like serial algorithms, parallel algorithms can employ random-number
generators. This problem explores how to adapt the measures of work,
span, and parallelism to handle the expected behavior of randomized
task-parallel algorithms. It also asks you to design and analyze a
parallel algorithm for randomized quicksort.
a. Explain how to modify the work law (26.2), span law (26.3), and
greedy scheduler bound (26.4) to work with expectations when TP,
T 1, and T∞are all random variables.
b. Consider a randomized parallel algorithm for which 1% of the time,
T 1 = 104 and T 10,000 = 1, but for the remaining 99% of the time, T 1
= T 10,000 = 109. Argue that the speedup of a randomized parallel
algorithm should be defined as E[ T 1]/ E[ TP], rather than E[ T 1/ TP].
c. Argue that the parallelism of a randomized task-parallel algorithm
should be defined as the ratio E[ T 1]/ E[ T∞].
d. Parallelize the RANDOMIZED-QUICKSORT algorithm on page
192 by using recursive spawning to produce P-RANDOMIZED-
QUICKSORT. (Do not parallelize RANDOMIZED-PARTITION.)
e. Analyze your parallel algorithm for randomized quicksort. ( Hint:
Review the analysis of RANDOMIZED-SELECT on page 230.)
f. Parallelize RANDOMIZED-SELECT on page 230. Make your
implementation as parallel as possible. Analyze your algorithm. ( Hint:
Use the partitioning algorithm from Exercise 26.3-3.)
Chapter notes
Parallel computers and algorithmic models for parallel programming
have been around in various forms for years. Prior editions of this book
included material on sorting networks and the PRAM (Parallel
Random-Access Machine) model. The data-parallel model [58, 217] is another popular algorithmic programming model, which features
operations on vectors and matrices as primitives. The notion of
sequential consistency is due to Lamport [275].
Graham [197] and Brent [71] showed that there exist schedulers achieving the bound of Theorem 26.1. Eager, Zahorjan, and Lazowska
[129] showed that any greedy scheduler achieves this bound and proposed the methodology of using work and span (although not by
those names) to analyze parallel algorithms. Blelloch [57] developed an algorithmic programming model based on work and span (which he
called “depth”) for data-parallel programming. Blumofe and Leiserson
[63] gave a distributed scheduling algorithm for task-parallel computations based on randomized “work-stealing” and showed that it
achieves the bound E[ TP] ≤ T 1/ P + O( T∞). Arora, Blumofe, and Plaxton [20] and Blelloch, Gibbons, and Matias [61] also provided provably good algorithms for scheduling task-parallel computations.
The recent literature contains many algorithms and strategies for
scheduling parallel programs.
The parallel pseudocode and programming model were influenced by
Cilk [290, 291, 383, 396]. The open-source project OpenCilk
(www.opencilk.org) provides Cilk programming as an extension to the C and C++ programming languages. All of the parallel algorithms in
this chapter can be coded straightforwardly in Cilk.
Concerns about nondeterministic parallel programs were expressed
by Lee [281] and Bocchino, Adve, Adve, and Snir [64]. The algorithms literature contains many algorithmic strategies (see, for example, [60, 85,
118, 140, 160, 282, 283, 412, 461]) for detecting races and extending the fork-join model to avoid or safely embrace various kinds of
nondeterminism. Blelloch, Fineman, Gibbons, and Shun [59] showed that deterministic parallel algorithms can often be as fast as, or even
faster than, their nondeterministic counterparts.
Several of the parallel algorithms in this chapter appeared in
unpublished lecture notes by C. E. Leiserson and H. Prokop and were
originally implemented in Cilk. The parallel merge-sorting algorithm
was inspired by an algorithm due to Akl [12].
1 In mathematics, a projection is an idempotent function, that is, a function f such that f ○ f = f.
In this case, the function f maps the set P of fork-join programs to the set P S ⊂ P of serial programs, which are themselves fork-join programs with no parallelism. For a fork-join program x ∈ P, since we have f ( f ( x)) = f ( x), the serial projection, as we have defined it, is indeed a mathematical projection.
2 Also called a computation dag in the literature.
Most problems described in this book have assumed that the entire
input was available before the algorithm executes. In many situations,
however, the input becomes available not in advance, but only as the
algorithm executes. This idea was implicit in much of the discussion of
data structures in Part III. The reason that you want to design, for example, a data structure that can handle n INSERT, DELETE, and
SEARCH operations in O(lg n) time per operation is most likely because you are going to receive n such operation requests without
knowing in advance what operations will be coming. This idea was also
implicit in amortized analysis in Chapter 16, where we saw how to maintain a table that can grow or shrink in response to a sequence of
insertion and deletion operations, yet with a constant amortized cost
per operation.
An online algorithm receives its input progressively over time, rather
than having the entire input available at the start, as in an offline
algorithm. Online algorithms pertain to many situations in which
information arrives gradually. A stock trader must make decisions
today, without knowing what the prices will be tomorrow, yet wants to
achieve good returns. A computer system must schedule arriving jobs
without knowing what work will need to be done in the future. A store
must decide when to order more inventory without knowing what the
future demand will be. A driver for a ride-hailing service must decide
whether to pick up a fare without knowing who will request rides in the
future. In each of these situations, and many more, algorithmic
decisions must be made without knowledge of the future.
There are several approaches for dealing with unknown future
inputs. One approach is to form a probabilistic model of future inputs
and design an algorithm that assumes future inputs conform to the
model. This technique is common, for example, in the field of queuing
theory, and it is also related to machine learning. Of course, you might
not be able to develop a workable probabilistic model, or even if you
can, some inputs might not conform to it. This chapter takes a different
approach. Instead of assuming anything about the future input, we
employ a conservative strategy of limiting how poor a solution any
input can entail.
This chapter, therefore, adopts a worst-case approach, designing
online algorithms that guarantee the quality of the solution for all
possible future inputs. We’ll analyze online algorithms by comparing
the solution produced by the online algorithm with a solution produced
by an optimal algorithm that knows the future inputs, and taking a
worst-case ratio over all possible instances. We call this methodology
competitive analysis. We’ll use a similar approach when we study
approximation algorithms in Chapter 35, where we’ll compare the solution returned by an algorithm that might be suboptimal with the
value of the optimal solution, and determine a worst-case ratio over all
possible instances.
We start with a “toy” problem: deciding between whether to take the
elevator or the stairs. This problem will introduce the basic
methodology of thinking about online algorithms and how to analyze
them via competitive analysis. We will then look at two problems that
use competitive analysis. The first is how to maintain a search list so
that the access time is not too large, and the second is about strategies
for deciding which cache blocks to evict from a cache or other kind of
fast computer memory.
Our first example of an online algorithm models a problem that you
likely have encountered yourself: whether you should wait for an
elevator to arrive or just take the stairs. Suppose that you enter a
building and wish to visit an office that is k floors up. You have two choices: walk up the stairs or take the elevator. Let’s assume, for
convenience, that you can climb the stairs at the rate of one floor per
minute. The elevator travels much faster than you can climb the stairs: it
can ascend all k floors in just one minute. Your dilemma is that you do
not know how long it will take for the elevator to arrive at the ground
floor and pick you up. Should you take the elevator or the stairs? How
do you decide?
Let’s analyze the problem. Taking the stairs takes k minutes, no
matter what. Suppose you know that the elevator takes at most B − 1
minutes to arrive for some value of B that is considerably higher than k.
(The elevator could be going up when you call for it and then stop at
several floors on its way down.) To keep things simple, let’s also assume
that the number of minutes for the elevator to arrive is an integer.
Therefore, waiting for the elevator and taking it k floors up takes
anywhere from one minute (if the elevator is already at the ground floor)
to ( B − 1) + 1 = B minutes (the worst case). Although you know B and k, you don’t know how long the elevator will take to arrive this time.
You can use competitive analysis to inform your decision regarding
whether to take the stairs or elevator. In the spirit of competitive
analysis, you want to be sure that, no matter what the future brings (i.e.,
how long the elevator takes to arrive), you will not wait much longer
than a seer who knows when the elevator will arrive.
Let us first consider what the seer would do. If the seer knows that
the elevator is going to arrive in at most k − 1 minutes, the seer waits for
the elevator, and otherwise, the seer takes the stairs. Letting m denote
the number of minutes it takes for the elevator to arrive at the ground
floor, we can express the time that the seer spends as the function
We typically evaluate online algorithms by their competitive ratio.
Let U denote the set (universe) of all possible inputs, and consider some
input I ∈ U. For a minimization problem, such as the stairs-versus-elevator problem, if an online algorithm A produces a solution with


value A( I) on input I and the solution from an algorithm F that knows the future has value F( I) on the same input, then the competitive ratio of algorithm A is
max { A( I)/ F( I) : I ∈ U}.
If an online algorithm has a competitive ratio of c, we say that it is c-
competitive. The competitive ratio is always at least 1, so that we want
an online algorithm with a competitive ratio as close to 1 as possible.
In the stairs-versus-elevator problem, the only input is the time for
the elevator to arrive. Algorithm F knows this information, but an
online algorithm has to make a decision without knowing when the
elevator will arrive. Consider the algorithm “always take the stairs,”
which always takes exactly k minutes. Using equation (27.1), the
competitive ratio is
Enumerating the terms in equation (27.2) gives the competitive ratio as
so that the competitive ratio is k. The maximum is achieved when the
elevator arrives immediately. In this case, taking the stairs requires k
minutes, but the optimal solution takes just 1 minute.
Now let’s consider the opposite approach: “always take the elevator.”
If it takes m minutes for the elevator to arrive at the ground floor, then
this algorithm will always take m + 1 minutes. Thus the competitive
ratio becomes
max {( m + 1)/ t( m) : 0 ≤ m ≤ B − 1},
which we can again enumerate as
Now the maximum is achieved when the elevator takes B − 1 minutes to
arrive, compared with the optimal approach of taking the stairs, which
requires k minutes.

Hence, the algorithm “always take the stairs” has competitive ratio
k, and the algorithm “always take the elevator” has competitive ratio B/ k. Because we prefer the algorithm with smaller competitive ratio, if k
= 10 and B = 300, we prefer “always take the stairs,” with competitive
ratio 10, over “always take the elevator,” with competitive ratio 30.
Taking the stairs is not always better, or necessarily more often better.
It’s just that taking the stairs guards better against the worst-case future.
These two approaches of always taking the stairs and always taking
the elevator are extreme solutions, however. Instead, you can “hedge
your bets” and guard even better against a worst-case future. In
particular, you can wait for the elevator for a while, and then if it doesn’t
arrive, take the stairs. How long is “a while”? Let’s say that “a while” is
k minutes. Then the time h( m) required by this hedging strategy, as a function of the number m of minutes before the elevator arrives, is
In the second case, h( m) = 2 k because you wait for k minutes and then climb the stairs for k minutes. The competitive ratio is now
max { h( m)/ t( m) : 0 ≤ m ≤ B − 1}.
Enumerating this ratio yields
The competitive ratio is now independent of k and B.
This example illustrates a common philosophy in online algorithms:
we want an algorithm that guards against any possible worst case.
Initially, waiting for the elevator guards against the case when the
elevator arrives quickly, but eventually switching to the stairs guards
against the case when the elevator takes a long time to arrive.
Exercises
27.1-1
Suppose that when hedging your bets, you wait for p minutes, instead of for k minutes, before taking the stairs. What is the competitive ratio as a
function of p and k? How should you choose p to minimize the competitive ratio?
27.1-2
Imagine that you decide to take up downhill skiing. Suppose that a pair
of skis costs r dollars to rent for a day and b dollars to buy, where b > r.
If you knew in advance how many days you would ever ski, your
decision whether to rent or buy would be easy. If you’ll ski for at least
⌈ b/ r⌉ days, then you should buy skis, and otherwise you should rent.
This strategy minimizes the total that you ever spend. In reality, you
don’t know in advance how many days you’ll eventually ski. Even after
you have skied several times, you still don’t know how many more times
you’ll ever ski. Yet you don’t want to waste your money. Give and
analyze an algorithm that has a competitive ratio of 2, that is, an
algorithm guaranteeing that, no matter how many times you ski, you
never spend more than twice what you would have spent if you knew
from the outset how many times you’ll ski.
27.1-3
In “concentration solitaire,” a game for one person, you have n pairs of
matching cards. The backs of the cards are all the same, but the fronts
contain pictures of animals. One pair has pictures of aardvarks, one pair
has pictures of bears, one pair has pictures of camels, and so on. At the
start of the game, the cards are all placed face down. In each round, you
can turn two cards face up to reveal their pictures. If the pictures match,
then you remove that pair from the game. If they don’t match, then you
turn both of them over, hiding their pictures once again. The game ends
when you have removed all n pairs, and your score is how many rounds
you needed to do so. Suppose that you can remember the picture on
every card that you have seen. Give an algorithm to play concentration
solitaire that has a competitive ratio of 2.
27.2 Maintaining a search list
The next example of an online algorithm pertains to maintaining the order of elements in a linked list, as in Section 10.2. This problem often arises in practice for hash tables when collisions are resolved by
chaining (see Section 11.2), since each slot contains a linked list.
Reordering the linked list of elements in each slot of the hash table can
boost the performance of searches measurably.
The list-maintenance problem can be set up as follows. You are given
a list L of n elements { x 1, x 2, … , xn}. We’ll assume that the list is doubly linked, although the algorithms and analysis work just as well
for singly linked lists. Denote the position of element xi in the list L by rL( xi), where 1 ≤ rL( xi) ≤ n. Calling LIST-SEARCH( L, xi) on page 260
thus takes Θ( rL( xi)) time.
If you know in advance something about the distribution of search
requests, then it makes sense to arrange the list ahead of time to put the
more frequently searched elements closer to the front, which minimizes
the total cost (see Exercise 27.2-1). If instead you don’t know anything
about the search sequence, then no matter how you arrange the list, it is
possible that every search is for whatever element appears at the tail of
the list. The total searching time would then be Θ( nm), where m is the
number of searches.
If you notice patterns in the access sequence or you observe
differences in the frequencies in which elements are accessed, then you
might want to rearrange the list as you perform searches. For example,
if you discover that every search is for a particular element, you could
move that element to the front of the list. In general, you could
rearrange the list after each call to LIST-SEARCH. But how would you
do so without knowing the future? After all, no matter how you move
elements around, every search could be for the last element.
But it turns out that some search sequences are “easier” than others.
Rather than just evaluate performance on the worst-case sequence, let’s
compare a reorganization scheme with whatever an optimal offline
algorithm would do if it knew the search sequence in advance. That way,
if the sequence is fundamentally hard, the optimal offline algorithm will
also find it hard, but if the sequence is easy, we can hope to do
reasonably well.
To ease analysis, we’ll drop the asymptotic notation and say that the
cost is just i to search for the i th element in the list. Let’s also assume that the only way to reorder the elements in the list is by swapping two
adjacent elements in the list. Because the list is doubly linked, each swap
incurs a cost of 1. Thus, for example, a search for the sixth element
followed by moving it forward two places (entailing two swaps) incurs a
total cost 8. The goal is to minimize the total cost of calls to LIST-
SEARCH plus the total number of swaps performed.
The online algorithm that we’ll explore is MOVE-TO-FRONT( L, x).
This procedure first searches for x in the doubly linked list L, and then it moves x to the front of the list.1 If x is located at position r = rL( x) before the call, MOVE-TO-FRONT swaps x with the element in
position r − 1, then with the element in position r − 2, and so on, until it finally swaps x with the element in position 1. Thus if the call MOVE-TO-FRONT( L, 8) executes on the list L = 〈5, 3, 12, 4, 8, 9, 22〉, the list becomes 〈8, 5, 3, 12, 4, 9, 22〉. The call MOVE-TO-FRONT( L, k) costs
2 rL( k) − 1: it costs rL( k) to search for k, and it costs 1 for each of the rL( k) − 1 swaps that move k to the front of the list.
Figure 27.1 The costs incurred by the procedures FORESEE and MOVE-TO-FRONT when searching for the elements 5, 3, 4, and 4, starting with the list L = 〈1, 2, 3, 4, 5〉. If FORESEE
instead moved 3 to the front after the search for 5, the cumulative cost would not change, nor would the cumulative cost change if 4 moved to the second position after the search for 5.
We’ll see that MOVE-TO-FRONT has a competitive ratio of 4. Let’s
think about what this means. MOVE-TO-FRONT performs a series of
operations on a doubly linked list, accumulating cost. For comparison,
suppose that there is an algorithm FORESEE that knows the future.
Like MOVE-TO-FRONT, it also searches the list and moves elements
around, but after each call it optimally rearranges the list for the future.




(There may be more than one optimal order.) Thus FORESEE and
MOVE-TO-FRONT maintain different lists of the same elements.
Consider the example shown in Figure 27.1. Starting with the list 〈1, 2, 3, 4, 5〉, four searches occur, for the elements 5, 3, 4, and 4. The
hypothetical procedure FORESEE, after searching for 3, moves 4 to the
front of the list, knowing that a search for 4 is imminent. It thus incurs a
swap cost of 3 upon its second call, after which no further swap costs
accrue. MOVE-TO-FRONT incurs swap costs in each step, moving the
found element to the front. In this example, MOVE-TO-FRONT has a
higher cost in each step, but that is not necessarily always the case.
The key to proving the competitive bound is to show that at any
point, the total cost of MOVE-TO-FRONT is not much higher than
that of FORESEE. Surprisingly, we can determine a bound on the costs
incurred by MOVE-TO-FRONT relative to FORESEE even though
MOVE-TO-FRONT cannot see the future.
If we compare any particular step, MOVE-TO-FRONT and
FORESEE may be operating on very different lists and do very
different things. If we focus on the search for 4 above, we observe that
FORESEE actually moves it to the front of the list early, paying to
move the element to the front before it is accessed. To capture this
concept, we use the idea of an inversion: a pair of elements, say a and b, in which a appears before b in one list, but b appears before a in another list. For two lists L and L′, let I( L, L′), called the inversion count, denote the number of inversions between the two lists, that is, the number of
pairs of elements whose order differs in the two lists. For example, with
lists L = 〈5,3,1,4,2〉 and L′ = 〈3,1,2,4,5〉, then out of the
pairs,
exactly five of them—(1, 5), (2, 4), (2, 5), (3, 5), (4, 5)—are inversions,
since these pairs, and only these pairs, appear in different orders in the
two lists. Thus the inversion count is I( L, L′) = 5.
In order to analyze the algorithm, we define the following notation.
Let be the list maintained by MOVE-TO-FRONT immediately after
the i th search, and similarly, let be FORESEE’s list immediately after
the i th search. Let and be the costs incurred by MOVE-TO-
FRONT and FORESEE on their i th calls, respectively. We don’t know
how many swaps FORESEE performs in its i th call, but we’ll denote











that number by ti. Therefore, if the i th operation is a search for element x, then
In order to compare these costs more carefully, let’s break down the
elements into subsets, depending on their positions in the two lists
before the i th search, relative to the element x being searched for in the i th search. We define three sets:
BB = {elements before x in both and },
BA = {elements before x in but after x in },
AB = {elements after x in but before x in }.
We can now relate the position of element x in and to the sizes of
these sets:
When a swap occurs in one of the lists, it changes the relative
positions of the two elements involved, which in turn changes the
inversion count. Suppose that elements x and y are swapped in some list. Then the only possible difference in the inversion count between
this list and any other list depends on whether ( x, y) is an inversion. In fact, the inversion count of ( x, y) with respect to any other list must change. If ( x, y) is an inversion before the swap, it no longer is afterward, and vice versa. Therefore, if two consecutive elements x and
y swap positions in a list L, then for any other list L′, the value of the inversion count I( L, L′) either increases by 1 or decreases by 1.
As we compare MOVE-TO-FRONT and FORESEE searching and
modifying their lists, we’ll think about MOVE-TO-FRONT executing
on its list for the i th time and then FORESEE executing on its list for
the i th time. After MOVE-TO-FRONT has executed for the i th time and before FORESEE has executed for the i th time, we’ll compare
(the inversion count immediately before the i th call of MOVE-
TO-FRONT) with
(the inversion count after the i th call of










MOVE-TO-FRONT but before the i th call of FORESEE). We’ll
concern ourselves later with what FORESEE does.
Let us analyze what happens to the inversion count after executing
the i th call of MOVE-TO-FRONT, and suppose that it searches for
element x. More precisely, we’ll compute
, the
change in the inversion count, which gives a rough idea of how much
MOVE-TO-FRONT’s list becomes more or less like FORESEE’s list.
After searching, MOVE-TO-FRONT performs a series of swaps with
each of the elements on the list
that precedes x. Using the notation
above, the number of such swaps is | BB| + | BA|. Bearing in mind that the list has yet to be changed by the i th call of FORESEE, let’s see how
the inversion count changes.
Consider a swap with an element y ∈ BB. Before the swap, y precedes x in both
and
. After the swap, x precedes y in , and
does not change. Therefore, the inversion count increases by 1 for
each element in BB. Now consider a swap with an element z ∈ BA.
Before the swap, z precedes x in
but x precedes z in
. After the
swap, x precedes z in both lists. Therefore, the inversion count decreases by 1 for each element in BA. Thus altogether, the inversion count
increases by
We have laid the groundwork needed to analyze MOVE-TO-
FRONT.
Theorem 27.1
Algorithm MOVE-TO-FRONT has a competitive ratio of 4.
Proof The proof uses a potential function, as described in Chapter 16
on amortized analysis. The value Φ i of the potential function after the
i th calls of MOVE-TO-FRONT and FORESEE depends on the
inversion count:
(Intuitively, the factor of 2 embodies the notion that each inversion
represents a cost of 2 for MOVE-TO-FRONT relative to FORESEE: 1




for searching and 1 for swapping.) By equation (27.7), after the i th call
of MOVE-TO-FRONT, but before the i th call of FORESEE, the
potential increases by 2(| BB| − | BA|). Since the inversion count of the two lists is nonnegative, we have Φ i ≥ 0 for all i ≥ 0. Assuming that MOVE-TO-FRONT and FORESEE start with the same list, the initial
potential Φ0 is 0, so that Φ i ≥ Φ0 for all i.
Drawing from equation (16.2) on page 456, the amortized cost of
the i th MOVE-TO-FRONT operation is
where , the actual cost of the i th MOVE-TO-FRONT operation, is
given by equation (27.3):
Now, let’s consider the potential change Φ i − Φ i−1. Since both LM and LF change, let’s consider the changes to one list at a time. Recall that
when MOVE-TO-FRONT moves element x to the front, it increases the
potential by exactly 2(| BB| − | BA|). We now consider how the optimal
algorithm FORESEE changes its list LF: it performs ti swaps. Each swap performed by FORESEE either increases or decreases the
potential by 2, and thus the increase in potential by FORESEE in the
i th call can be at most 2 ti. We therefore have
We now finish the proof as in Chapter 16 by showing that the total
amortized cost provides an upper bound on the total actual cost,
because the initial potential function is 0 and the potential function is

always nonnegative. By equation (16.3) on page 456, for any sequence of
m MOVE-TO-FRONT operations, we have
Therefore, we have
Thus the total cost of the m MOVE-TO-FRONT operations is at most 4
times the total cost of the m FORESEE operations, so MOVE-TO-
FRONT is 4-competitive.
▪
Isn’t it amazing that we can compare MOVE-TO-FRONT with the
optimal algorithm FORESEE when we have no idea of the swaps that
FORESEE makes? We were able to relate the performance of MOVE-
TO-FRONT to the optimal algorithm by capturing how particular
properties (swaps in this case) must evolve relative to the optimal
algorithm, without actually knowing the optimal algorithm.
The online algorithm MOVE-TO-FRONT has a competitive ratio of
4: on any input sequence, it incurs a cost at most 4 times that of any
other algorithm. On a particular input sequence, it could cost much less
than 4 times the optimal algorithm, perhaps even matching the optimal
algorithm.
Exercises
27.2-1
You are given a set S = { x 1, x 2, … , xn} of n elements, and you wish to make a static list L (no rearranging once the list is created) containing

the elements of S that is good for searching. Suppose that you have a
probability distribution, where p( xi) is the probability that a given search searches for element xi. Argue that the expected cost for m searches is
Prove that this sum is minimized when the elements of L are sorted in
decreasing order with respect to p( xi).
27.2-2
Professor Carnac claims that since FORESEE is an optimal algorithm
that knows the future, then at each step it must incur no more cost than
MOVE-TO-FRONT. Either prove that Professor Carnac is correct or
provide a counterexample.
27.2-3
Another way to maintain a linked list for efficient searching is for each
element to maintain a frequency count: the number of times that the
element has been searched for. The idea is to rearrange list elements
after searches so that the list is always sorted by decreasing frequency
count, from largest to smallest. Either show that this algorithm is O(1)-
competitive, or prove that it is not.
27.2-4
The model in this section charged a cost of 1 for each swap. We can
consider an alternative cost model in which, after accessing x, you can
move x anywhere earlier in the list, and there is no cost for doing so.
The only cost is the cost of the actual accesses. Show that MOVE-TO-
FRONT is 2-competitive in this cost model, assuming that the number
requests is sufficiently large. ( Hint: Use the potential function
.)
In Section 15.4, we studied the caching problem, in which blocks of data from the main memory of a computer are stored in the cache: a small
but faster memory. In that section, we studied the offline version of the
problem, in which we assumed that we knew the sequence of memory
requests in advance, and we designed an algorithm to minimize the
number of cache misses. In almost all computer systems, caching is, in
fact, an online problem. We do not generally know the series of cache
requests in advance; they are presented to the algorithm only as the
requests for blocks are actually made. To gain a better understanding of
this more realistic scenario, we analyze online algorithms for caching.
We will first see that all deterministic online algorithms for caching have
a lower bound of Ω( k) for the competitive ratio, where k is the size of the cache. We will then present an algorithm with a competitive ratio of
Θ( n), where the input size is n, and one with a competitive ratio of O( k), which matches the lower bound. We will end by showing how to use
randomization to design an algorithm with a much better competitive
ratio of Θ(lg k). We will also discuss the assumptions that underlie randomized online algorithms, via the notion of an adversary, such as
we saw in Chapter 11 and will see in Chapter 31.
You can find the terminology used to describe the caching problem
in Section 15.4, which you might wish to review before proceeding.
27.3.1 Deterministic caching algorithms
In the caching problem, the input comprises a sequence of n memory
requests, for data in blocks b 1, b 2, … , bn, in that order. The blocks requested are not necessarily distinct: each block may appear multiple
times within the request sequence. After block bi is requested, it resides
in a cache that can hold up to k blocks, where k is a fixed cache size. We assume that n > k, since otherwise we are assured that the cache can hold all the requested blocks at once. When a block bi is requested, if it
is already in the cache, then a cache hit occurs and the cache remains
unchanged. If bi is not in the cache, then a cache miss occurs. If the cache contains fewer than k blocks upon a cache miss, block bi is placed into the cache, which now contains one block more than before. If a
cache miss occurs with an already full cache, however, some block must be evicted from the cache before bi can enter. Thus, a caching algorithm
must decide which block to evict from the cache upon a cache miss
when the cache is full. The goal is to minimize the number of cache
misses over the entire request sequence. The caching algorithms
considered in this chapter differ only in which block they decide to evict
upon a cache miss. We do not consider abilities such as prefetching, in
which a block is brought into the cache before an upcoming request in
order to avert a future cache miss.
There are many online caching policies to determine which block to
evict, including the following:
First-in, first-out (FIFO): evict the block that has been in the
cache the longest time.