The goal is to find a perfect matching M* (see Exercises 25.1-5 and
25.1-6) whose edges have the maximum total weight over all perfect
matchings. That is, letting w( M) = ∑( l, r)∈ M w( l, r) denote the total weight of the edges in matching M, we want to find a perfect matching
M* such that
w( M*) = max { w( M) : M is a perfect matching}.
We call finding such a maximum-weight perfect matching the
assignment problem. A solution to the assignment problem is a perfect
matching that maximizes the total utility. Like the stable-marriage
problem, the assignment problem finds a matching that is “good,” but
with a different definition of good: maximizing total value rather than
achieving stability.
Although you could enumerate all n! perfect matchings to solve the
assignment problem, an algorithm known as the Hungarian algorithm
solves it much faster. This section will prove an O( n 4) time bound, and Problem 25-2 asks you to refine the algorithm to reduce the running
time to O( n 3). Instead of working with the complete bipartite graph G, the Hungarian algorithm works with a subgraph of G called the
“equality subgraph.” The equality subgraph, which is defined below,
changes over time and has the beneficial property that any perfect
matching in the equality subgraph is also an optimal solution to the
assignment problem.
The equality subgraph depends on assigning an attribute h to each
vertex. We call h the label of a vertex, and we say that h is a feasible vertex labeling of G if
l. h + r. h ≥ w( l, r) for all l ∈ L and r ∈ R.
A feasible vertex labeling always exists, such as the default vertex
labeling given by

Given a feasible vertex labeling h, the equality subgraph Gh = ( V, Eh) of G consists of the same vertices as G and the subset of edges
Eh = {( l, r) ∈ E : l. h + r. h = w( l, r)}.
The following theorem ties together a perfect matching in an
equality subgraph and an optimal solution to the assignment problem.
Theorem 25.14
Let G = ( V, E), where V = L ∪ R, be a complete bipartite graph where each edge ( l, r) ∈ E has weight w( l, r). Let h be a feasible vertex labeling of G and Gh be the equality subgraph of G. If Gh contains a perfect matching M*, then M* is an optimal solution to the assignment problem on G.
Proof If Gh contains a perfect matching M*, then because Gh and G
have the same sets of vertices, M* is also a perfect matching in G.
Because each edge of M* belongs to Gh and each vertex has exactly one
incident edge from any perfect matching, we have
Letting M be any perfect matching in G, we have
Thus, we have
so that M* is a maximum-weight perfect matching in G.
▪
The goal now becomes finding a perfect matching in an equality
subgraph. Which equality subgraph? It does not matter! We have free
rein to not only choose an equality subgraph, but to change which
equality subgraph we choose as we go along. We just need to find some
perfect matching in some equality subgraph.
To understand the equality subgraph better, consider again the proof
of Theorem 25.14 and, in the second half, let M be any matching. The
proof is still valid, in particular, inequality (25.3): the weight of any
matching is always at most the sum of the vertex labels. If we choose any
set of vertex labels that define an equality subgraph, then a maximum-
cardinality matching in this equality subgraph has total value at most
the sum of the vertex labels. If the set of vertex labels is the “right” one,
then it will have total value equal to w( M*), and a maximum-cardinality matching in the equality subgraph is also a maximum-weight perfect
matching. The Hungarian algorithm repeatedly modifies the matching
and the vertex labels in order to achieve this goal.
The Hungarian algorithm starts with any feasible vertex labeling h
and any matching M in the equality subgraph Gh. It repeatedly finds an
M-augmenting path P in Gh and, using Lemma 25.1, updates the matching to be M ⊕ P, thereby incrementing the size of the matching.
As long as there is some equality subgraph that contains an M-
augmenting path, the size of the matching can increase, until a perfect
matching is achieved.
Four questions arise:
1. What initial feasible vertex labeling should the algorithm start
with? Answer: the default vertex labeling given by equations
(25.1) and (25.2).
2. What initial matching in Gh should the algorithm start with?
Short answer: any matching, even an empty matching, but a
greedy maximal matching works well.
3. If an M-augmenting path exists in Gh, how to find it? Short answer: use a variant of breadth-first search similar to the
second phase of the procedure used in the Hopcroft-Karp
algorithm to find a maximal set of shortest M-augmenting paths.
4. What if the search for an M-augmenting path fails? Short
answer: update the feasible vertex labeling to bring in at least one
new edge.
We’ll elaborate on the short answers using the example that starts in
Figure 25.4. Here, L = { l 1, l 2, … , l 7} and R = { r 1, r 2, … , r 7}. The edge weights appear in the matrix shown in part (a), where the weight
w( li, rj) appears in row i and column j. The feasible vertex labels, given by the default vertex labeling, appear to the left of and above the
matrix. Matrix entries in red indicate edges ( li, rj) for which li. h + rj. h =
w( li, r j), that is, edges in the equality subgraph Gh appearing in part (b) of the figure.
Greedy maximal bipartite matching
There are several ways to implement a greedy method to find a maximal
bipartite
matching.
The
procedure
GREEDY-BIPARTITE-
MATCHING shows one. Edges in Figure 25.4(b) highlighted in blue indicate the initial greedy maximal matching in Gh. Exercise 25.3-2 asks
you to show that the GREEDY-BIPARTITE-MATCHING procedure
returns a matching that is at least half the size of a maximum matching.
GREEDY-BIPARTITE-MATCHING ( G)
1 M = Ø
2 for each vertex l ∈ L
3
if l has an unmatched neighbor in R
4
choose any such unmatched neighbor r ∈ R
5
M = M ∪ {( l, r)}
6 return M
Figure 25.4 The start of the Hungarian algorithm. (a) The matrix of edge weights for a bipartite graph with L = { l 1, l 2, … , l 7}. The value in row i and column j indicates w( li, rj). Feasible vertex labels appear above and next to the matrix. Red entries correspond to edges in the equality subgraph. (b) The equality subgraph Gh. Edges highlighted in blue belong to the initial greedy maximal matching M. Blue vertices are matched, and tan vertices are unmatched. (c) The directed equality subgraph GM,h created from Gh by directing edges in M from R to L and all other edges from L to R.
Finding an M-augmenting path in Gh
To find an M-augmenting path in the equality subgraph Gh with a matching M, the Hungarian algorithm first creates the directed equality
subgraph GM,h from Gh, just as the Hopcroft-Karp algorithm creates GM from G. As in the Hopcroft-Karp algorithm, you can think of an
M-augmenting path as starting from an unmatched vertex in L, ending
at an unmatched vertex in R, taking unmatched edges from L to R, and taking matched edges from R to L. Thus, GM,h = ( V, EM,h), where EM,h={( l, r) : l ∈ L, r ∈ R, and ( l, r) ∈ Eh − M }(edges from L to R)
∪ {( r, l) : r ∈ R, l ∈ L, and ( l, r) ∈ M }
(edges from R to L).
Because an M-augmenting path in the directed equality subgraph GM.h
is also an M-augmenting path in the equality subgraph Gh, it suffices to find M-augmenting paths in GM.h. Figure 25.4(c) shows the directed equality subgraph GM,h corresponding to the equality subgraph Gh
and matching M from part (b) of the figure.
With the directed equality subgraph GM,h in hand, the Hungarian
algorithm searches for an M-augmenting path from any unmatched
vertex in L to any unmatched vertex in R. Any exhaustive graph-search
method suffices. Here, we’ll use breadth-first search, starting from all
the unmatched vertices in L (just as the Hopcroft-Karp algorithm does
when creating the dag H), but stopping upon first discovering some
unmatched vertex in R. Figure 25.5 shows the idea. To start from all the unmatched vertices in L, initialize the first-in, first-out queue with all the unmatched vertices in L, rather than just one source vertex. Unlike
the dag H in the Hopcroft-Karp algorithm, here each vertex needs just
one predecessor, so that the breadth-first search creates a breadth-first
forest F = ( VF, EF). Each unmatched vertex in L is a root in F.
In Figure 25.5(g), the breadth-first search has found the M-
augmenting path 〈( l 4, r 2), ( r 2, l 1), ( l 1, r 3), ( r 3, l 6), ( l 6, r 5)〉. Figure
25.6(a) shows the new matching created by taking the symmetric
difference of the matching M in Figure 25.5(a) with this M-augmenting path.
When the search for an M-augmenting path fails
Having updated the matching M from an M-augmenting path, the
Hungarian algorithm updates the directed equality subgraph GM,h
according to the new matching and then starts a new breadth-first
search from all the unmatched vertices in L. Figure 25.6 shows the start of this process, picking up from Figure 25.5.
In Figure 25.6(d), the queue contains vertices l 4 and l 3. Neither of these vertices has an edge that leaves it, however, so that once these
vertices are removed from the queue, the queue becomes empty. The
search terminates at this point, before discovering an unmatched vertex
in R to yield an M-augmenting path. Whenever this situation occurs, the most recently discovered vertices must belong to L. Why? Whenever
an unmatched vertex in R is discovered, the search has found an M-
augmenting path, and when a matched vertex in R is discovered, it has
an unvisited neighbor in L, which the search can then discover.
Recall that we have the freedom to work with any equality subgraph.
We can change the directed equality subgraph “on the fly,” as long we
do not counteract the work already done. The Hungarian algorithm
updates the feasible vertex labeling h to fulfill the following criteria:
1. No edge in the breadth-first forest F leaves the directed equality
subgraph.
2. No edge in the matching M leaves the directed equality
subgraph.
3. At least one edge ( l, r), where l ∈ L ∩ VF and r ∈ R − VF goes into Eh, and hence into EM,h. Therefore, at least one vertex in R
will be newly discovered.
Thus, at least one new edge enters the directed equality subgraph, and
any edge that leaves the directed equality subgraph belongs to neither
the matching M nor the breadth-first forest F. Newly discovered vertices in R are enqueued, but their distances are not necessarily 1 greater than
the distances of the most recently discovered vertices in L.
Figure 25.5 Finding an M-augmenting path in GM,h by breadth-first search. (a) The directed equality subgraph GM,h from Figure 25.4(c). (b)–(g) Successive versions of the breadth-first forest F, shown as the vertices at each distance from the roots—the unmatched vertices in L—
are discovered. In parts (b)–(f), the layer of vertices closest to the bottom of the figure are those in the first-in, first-out queue. For example, in part (b), the queue contains the roots 〈 l 4, l 5, l 7〉, and in part (e), the queue contains 〈 r 3, r 4〉, at distance 3 from the roots. In part (g), the unmatched vertex r 5 is discovered, so the breadth-first search terminates. The path 〈( l 4, r 2), ( r 2, l 1), ( l 1, r 3), ( r 3, l 6), ( l 6, r 5)〉, highlighted in orange in parts (a) and (g), is an M-augmenting path. Taking its symmetric difference with the matching M yields a new matching with one more edge than M.

Figure 25.6 (a) The new matching M and the new directed equality subgraph GM.h after updating the matching in Figure 25.5(a) with the M-augmenting path in Figure 25.5(g). (b)–(d) Successive versions of the breadth-first forest F in a new breadth-first search with roots l 5 and l 7. After the vertices l 4 and l 3 in part (d) have been removed from the queue, the queue becomes empty before the search can discover an unmatched vertex in R.
To update the feasible vertex labeling, the Hungarian algorithm first
computes the value
where FL = L ∩ VF and FR = R ∩ VF denote the vertices in the breadth-first forest F that belong to L and R, respectively. That is, δ is the smallest difference by which an edge incident on a vertex in FL
missed being in the current equality subgraph Gh. The Hungarian
algorithm then creates a new feasible vertex labeling, say h′, by
subtracting δ from l. h for all vertices l ∈ FL and adding δ to r. h for all vertices r ∈ FR:
The following lemma shows that these changes achieve the three criteria
above.
Lemma 25.15
Let h be a feasible vertex labeling for the complete bipartite graph G
with equality subgraph Gh, and let M be a matching for Gh and F be a breadth-first forest being constructed for the directed equality subgraph
GM,h. Then, the labeling h′ in equation (25.5) is a feasible vertex labeling for G with the following properties:
1. If ( u, v) is an edge in the breadth-first forest F for GM,h, then ( u, v) ∈ EM,h′.
2. If ( l, r) belongs to the matching M for Gh, then ( r, l) ∈ EM,h′.
3. There exist vertices l ∈ FL and r ∈ R − FR such that ( l, r) ∉
EM,h but ( l, r) ∈ EM,h′.
Proof We first show that h′ is a feasible vertex labeling for G. Because h is a feasible vertex labeling, we have l. h + r. h ≥ w( l, r) for all l ∈ L and r
∈ R. In order for h′ to not be a feasible vertex labeling, we would need l. h′ + r. h′ < l. h + r. h for some l ∈ L and r ∈ R. The only way this could occur would be for some l ∈ FL and r ∈ R − FR. In this instance, the amount of the decrease equals δ, so that l. h′ + r. h′ = l. h − δ + r. h. By equation (25.4), we have that l. h− δ+ r. h ≥ w( l, r) for any l ∈ FL and r ∈
R− FR, so that l. h′+ r. h′ ≥ w( l, r). For all other edges, we have l. h′ + r. h′ ≥
l. h+ r. h ≥ w( l, r). Thus, h′ is a feasible vertex labeling.
Now we show that each of the three desired properties holds:
1. If l ∈ FL and r ∈ FR, then we have l. h′+ r. h′ = l. h+ r. h because δ
is added to the label of l and subtracted from the label of r.
Therefore, if an edge belongs to F for the directed graph GM,h, it also belongs to GM,h′.
2. We claim that at the time the Hungarian algorithm computes the
new feasible vertex labeling h′, for every edge ( l, r) ∈ M, we have l ∈ FL if and only if r ∈ FR. To see why, consider a matched vertex r and let ( l, r) ∈ M. First suppose that r ∈ FR, so that the search discovered r and enqueued it. When r was removed from
the queue, l was discovered, so l ∈ FL. Now suppose that r ∉
FR, so r is undiscovered. We will show that l ∉ FL. The only edge in GM,h that enters l is ( r, l), and since r is undiscovered, the search has not taken this edge; if l ∈ FL, it is not because of
the edge ( r, l). The only other way that a vertex in L can be in FL
is if it is a root of the search, but only unmatched vertices in L
are roots and l is matched. Thus, l ∉ FL, and the claim is proved.
We already saw that l ∈ FL and r ∈ FR implies l. h′ + r. h′ = l. h +
r. h. For the opposite case, when l ∈ L − FL and R ∈ R − FR, we have that l. h′ = l. h and r. h′ = r. h, so that again l. h′ + r. h′ = l. h +
r. h. Thus, if edge ( l, r) is in the matching M for the equality graph Gh, then ( r, l) ∈ EM,h′.
3. Let ( l, r) be an edge not in Eh such that l ∈ FL, r ∈ R − FR, and δ = l. h + r. h − w( l, r). By the definition of δ, there is at least one such edge. Then, we have
l. h′ + r. h′= l. h − δ + r. h
= l. h − ( l. h + r. h − w( l, r)) + r. h
= w( l, r),
and thus ( l, r) ∈ Eh′. Since ( l, r) is not in Eh, it is not in the matching M, so that in EM,h′ it must be directed from L to R.
Thus, ( l, r) ∈ EM,h′.
▪
It is possible for an edge to belong to EM,h but not to EM,h′. By Lemma 25.15, any such edge belongs neither to the matching M nor to
the breadth-first forest F at the time that the new feasible vertex labeling
h′ is computed. (See Exercise 25.3-3.)
Going back to Figure 25.6(d), the queue became empty before an M-
augmenting path was found. Figure 25.7 shows the next steps taken by the algorithm. The value of δ = 1 is achieved by the edge ( l 5, r 3) because
in Figure 25.4(a), l 5. h + r 3. h − w( l 5, r 3) = 6 + 0 − 5 = 1. In Figure
25.7(a), the values of l 3. h, l 4. h, l 5. h, and l 7. h have decreased by 1 and
the values of r 2. h and r 7. h have increased by 1 because these vertices are in F. As a result, the edges ( l 1, r 2) and ( l 6, r 7) leave GM,h and the edge ( l 5, r 3) enters. Figure 25.7(b) shows the new directed equality subgraph GM,h. With edge ( l 5, r 3) now in GM,h, Figure 25.7(c) shows that this edge is added to the breadth-first forest F, and r 3 is added to the queue.
Parts (c)–(f) show the breadth-first forest continuing to be built until in
part (f), the queue once again becomes empty after vertex l 2, which has
no edges leaving, is removed. Again, the algorithm must update the
feasible vertex labeling and the directed equality subgraph. Now the
value of δ = 1 is achieved by three edges: ( l 1, r 6), ( l 5, r 6), and ( l 7, r 6).
As Figure 25.8 shows in parts (a) and (b), these edges enter GM,h, and edge ( l 6, r 3) leaves. Part (c) shows that edge ( l 1, r 6) is added to the breadth-first forest. (Either of edges ( l 5, r 6) or ( l 7, r 6) could have been added instead.) Because r 6 is unmatched, the search has found the M-
augmenting path 〈( l 5, r 3), ( r 3, l 1), ( l 1, r 6)〉, highlighted in orange.
Figure 25.9(a) shows GM,h after the matching M has been updated by taking its symmetric difference with the M-augmenting path. The
Hungarian algorithm starts its last breadth-first search, with vertex l 7 as
the only root. The search proceeds as shown in parts (b)–(h) of the
figure, until the queue becomes empty after removing l 4. This time, we
find that δ = 2, achieved by the five edges ( l 2, r 5), ( l 3, r 1), ( l 4, r 5), ( l 5, r 1), and ( l 5, r 5), each of which enters GM,h. Figure 25.10(a) shows the
results of decreasing the feasible vertex label of each vertex in FL by 2
and increasing the feasible vertex label of each vertex in F R by 2, and
Figure 25.10(b) shows the resulting directed equality subgraph GM,h.
Part (c) shows that edge ( l 3, r 1) is added to the breadth-first forest.
Since r 1 is an unmatched vertex, the search terminates, having found the
M-augmenting path 〈( l 7, r 7), ( r 7, l 3), ( l 3, r 1)〉, highlighted in orange. If r 1 had been matched, vertex r 5 would also have been added to the breadth-first forest, with any of l 2, l 4, or l 5 as its parent.
Figure 25.7 Updating the feasible vertex labeling and the directed equality subgraph GM,h when the queue becomes empty before finding an M-augmenting path. (a) With δ = 1, the values of l 3. h, l 4. h, l 5. h, and l 7. h decreased by 1 and r 2. h and r 7. h increased by 1. Edges ( l 1, r 2) and ( l 6, r 7) leave GM,h, and edge ( l 5, r 3) enters. These changes are highlighted in yellow. (b) The resulting directed equality subgraph GM,h. (c)–(f) With edge ( l 5, r 3) added to the breadth-first forest and r 3 added to the queue, the breadth-first search continues until the queue once again becomes empty in part (f).
After updating the matching M, the algorithm arrives at the perfect
matching shown for the equality subgraph Gh in Figure 25.11. By
Theorem 25.14, the edges in M form an optimal solution to the original
assignment problem given in the matrix. Here, the weights of edges ( l 1,
r 6), ( l 2, r 4), ( l 3, r 1), ( l 4, r 2), ( l 5, r 3), ( l 6, r 5), and ( l 7, r 7) sum to 65, which is the maximum weight of any matching.
The weight of the maximum-weight matching equals the sum of all
the feasible vertex labels. These problems—maximizing the weight of a
matching and minimizing the sum of the feasible vertex labels—are
“duals” of each other, in a similar vein to how the value of a maximum
flow equals the capacity of a minimum cut. Section 29.3 explores duality in more depth.
Figure 25.8 Another update to the feasible vertex labeling and directed equality subgraph GM,h because the queue became empty before finding an M-augmenting path. (a) With δ = 1, the values of l 1. h, l 2. h, l 3. h, l 4. h, l 5. h, and l 7. h decrease by 1, and r 2. h, r 3. h, r 4. h, and r 7. h increase by 1. Edge ( l 6, r 3) leaves GM,h, and edges ( l 1, r 6), ( l 5, r 6) and ( l 7, r 6) enter. (b) The resulting directed equality subgraph GM,h. (c) With edge ( l 1, r 6) added to the breadth-first forest and r 6
unmatched, the search terminates, having found the M-augmenting path 〈( l 5, r 3), ( r 3, l 1), ( l 1, r 6)〉, highlighted in orange in parts (b) and (c).
The procedure HUNGARIAN on page 737 and its subroutine FIND-
AUGMENTING-PATH on page 738 follow the steps we have just seen.
The third property in Lemma 25.15 ensures that in line 23 of FIND-
AUGMENTING-PATH the queue Q is nonempty. The pseudocode
uses the attribute π to indicate predecessor vertices in the breadth-first
forest. Instead of coloring vertices, as in the BFS procedure on page
556, the search puts the discovered vertices into the sets FL and FR.
Because the Hungarian algorithm does not need breadth-first distances,
the pseudocode omits the d attribute computed by the BFS procedure.
Figure 25.9 (a) The new matching M and the new directed equality subgraph GM,h after updating the matching in Figure 25.8 with the M-augmenting path in Figure 25.8 parts (b) and (c). (b)–(h) Successive versions of the breadth-first forest F in a new breadth-first search with root l 7. After the vertex l 4 in part (h) has been removed from the queue, the queue becomes empty before the search discovers an unmatched vertex in R.
Now, let’s see why the Hungarian algorithm runs in O( n 4) time, where | V| = n/2 and | E| = n 2 in the original graph G. (Below we outline how to reduce the running time to O( n 3).) You can go through the pseudocode of HUNGARIAN to verify that lines 1–6 and 11 take
O( n 2) time. The while loop of lines 7–10 iterates at most n times, since each iteration increases the size of the matching M by 1. Each test in
line 7 can take constant time by just checking whether | M| < n, each update of M in line 9 takes O( n) time, and the updates in line 10 take O( n 2) time.
To achieve the O( n 4) time bound, it remains to show that each call of
FIND-AUGMENTING-PATH runs in O( n 3) time. Let’s call each
execution of lines 10–22 a growth step. Ignoring the growth steps, you
can verify that FIND-AUGMENTING-PATH is a breadth-first search.
With the sets FL and FR represented appropriately, the breadth-first search takes O( V + E) = O( n 2) time. Within a call of FIND-AUGMENTING-PATH, at most n growth steps can occur, since each
growth step is guaranteed to discover at least one vertex in R. Since there are at most n 2 edges in GM,h, the for loop of lines 16–22 iterates at most n 2 times per call of FIND-AUGMENTING-PATH. The
bottleneck is lines 10 and 15, which take O( n 2) time, so that FIND-AUGMENTING-PATH takes O( n 3) time.
Figure 25.10 Updating the feasible vertex labeling and directed equality subgraph GM,h. (a) Here, δ = 2, so the values of l 1. h, l 2. h, l 3. h, l 4. h, l 5. h, and l 7. h decreased by 2, and the values of r 2. h, r 3. h, r 4. h, r 6. h, and r 7. h increased by 2. Edges ( l 2, r 5), ( l 3, r 1), ( l 4, r 5), ( l 5, r 1), and ( l 5, r 5) enter GM,h. (b) The resulting directed graph GM,h. (c) With edge ( l 3, r 1) added to the breadth-first forest and r 1 unmatched, the search terminates, having found the M-augmenting path 〈( l 7, r 7), ( r 7, l 3), ( l 3, r 1)〉, highlighted in orange in parts (b) and (c).
Exercise 25.3-5 asks you to show that reconstructing the directed
equality subgraph GM,h in line 15 is actually unnecessary, so that its
cost can be eliminated. Reducing the cost of computing δ in line 10 to
O( n) takes a little more effort and is the subject of Problem 25-2. With these changes, each call of FIND-AUGMENTING-PATH takes O( n 2)
time, so that the Hungarian algorithm runs in O( n 3) time.
Figure 25.11 The final matching, shown for the equality subgraph Gh with blue edges and blue entries in the matrix. The weights of the edges in the matching sum to 65, which is the maximum for any matching in the original complete bipartite graph G, as well as the sum of all the final feasible vertex labels.
HUNGARIAN ( G)
1 for each vertex l ∈ L
2
l. h = max { w( l, r) : r ∈ R}
// from equation (25.1)
3 for each vertex r ∈ R
4
r. h = 0
// from equation (25.2)
5 let M be any matching in Gh (such as the matching returned by
GREEDY-BIPARTITE-MATCHING)
6 from G, M, and h, form the equality subgraph Gh
and the directed equality subgraph GM,h
7 while M is not a perfect matching in Gh
8
P = FIND-AUGMENTING-PATH ( GM,h)
9
M = M ⊕ P
10
update the equality subgraph Gh
and the directed equality subgraph GM,h
11 return M
FIND-AUGMENTING-PATH ( GM,h)
1 Q = Ø
2 FL = Ø
3 FR = Ø
4 for each unmatched vertex l ∈ L
5
l. π = NIL
6
ENQUEUE ( Q, l)
7
FL = FL ∪ { l}
// forest F starts with unmatched
vertices in L
8 repeat
9
if Q is empty
// ran out of vertices to search from?
10
δ = min { l. h + r. h − w( l, r) : l ∈ FL and r ∈ R − FR}
11
for each vertex l ∈ FL
12
l. h = l. h − δ
// relabel according to equation
(25.5)
13
for each vertex r ∈ FR
14
r. h = r. h + δ
// relabel according to equation
(25.5)
15
from G, M, and h, form a new directed equality graph GM,h
16
for each new edge ( l, r)
// continue search with
in GM,h
new edges
17
if r ∉ FR
18
r. π = l
// discover r, add it to
F
19
if r is unmatched
20
an M-augmenting path has been found
20
(exit the repeat loop)
21
else ENQUEUE
// can search from r
later
22
FR = FR ∪ { r}
23
u = DEQUEUE ( Q)
// search from u
24
for each neighbor v of u in GM,h
25
if v ∈ L
26
v.π = u
27
FL = FL ∪ { v}
// discover v, add it to
F
28
ENQUEUE ( Q, v)
// can search from v
later
29
elseif v ∉ FR
// v ∈ R, do same as
lines 18–22
30
v.π = u
31
if v is unmatched
32
an M-augmenting path has been found
(exit the repeat loop)
33
else ENQUEUE ( Q, v)
34
FR = FR ∪ { v}
35 until an M-augmenting path has been found
36 using the predecessor attributes π, construct an M-augmenting path
P by tracing back from the unmatched vertex in R
37 return P
Exercises
25.3-1
The FIND-AUGMENTING-PATH procedure checks in two places
(lines 19 and 31) whether a vertex it discovers in R is unmatched. Show
how to rewrite the pseudocode so that it checks for an unmatched
vertex in R in only one place. What is the downside of doing so?
25.3-2
Show that for any bipartite graph, the GREEDY-BIPARTITE-
MATCHING procedure on page 726 returns a matching at least half
the size of a maximum matching.
Show that if an edge ( l, r) belongs to the directed equality subgraph GM,h but is not a member of GM,h′, where h′ is given by equation (25.5), then l ∈ L − FL and r ∈ FR at the time that h′ is computed.
25.3-4
At line 29 in the FIND-AUGMENTING-PATH procedure, it has
already been established that v ∈ R. This line checks to see whether v is already discovered by testing whether v ∈ FR. Why doesn’t the procedure need to check whether v is already discovered for the case when v ∈ L, in lines 26–28?
25.3-5
Professor Hrabosky asserts that the directed equality subgraph GM,h
must be constructed and maintained by the Hungarian algorithm, so
that line 6 of HUNGARIAN and line 15 of FIND-AUGMENTING-
PATH are required. Argue that the professor is incorrect by showing
how to determine whether an edge belongs to EM,h without explicitly
constructing GM,h.
25.3-6
How can you modify the Hungarian algorithm to find a matching of
vertices in L to vertices in R that minimizes, rather than maximizes, the sum of the edge weights in the matching?
25.3-7
How can an assignment problem with | L| ≠ | R| be modified so that the
Hungarian algorithm solves it?
Problems
25-1 Perfect matchings in a regular bipartite graph
a. Problem 20-3 asked about Euler tours in directed graphs. Prove that a
connected, undirected graph G = ( V, E) has an Euler tour—a cycle
traversing each edge exactly once, though it may visit a vertex multiple times—if and only if the degree of every vertex in V is even.
b. Assuming that G is connected, undirected, and every vertex in V has even degree, give an O( E)-time algorithm to find an Euler tour of G, as in Problem 20-3(b).
c. Exercise 25.1-6 states that if G = ( V, E) is a d-regular bipartite graph, then it contains d disjoint perfect matchings. Suppose that d is an
exact power of 2. Give an algorithm to find all d disjoint perfect
matchings in a d-regular bipartite graph in Θ( E lg d) time.
25-2 Reducing the running time of the Hungarian algorithm to O( n 3) In this problem, you will show how to reduce the running time of the
Hungarian algorithm from O( n 4) to O( n 3) by showing how to reduce the running time of the FIND-AUGMENTING-PATH procedure
from O( n 3) to O( n 2). Exercise 25.3-5 demonstrates that line 6 of HUNGARIAN and line 15 of FIND-AUGMENTING-PATH are
unnecessary. Now you will show how to reduce the running time of each
execution of line 10 in FIND-AUGMENTING-PATH to O( n).
For each vertex r ∈ R − FR, define a new attribute r.σ where r.σ = min { l. h + r. h − w( l, r) : l ∈ FL}.
That is, r.σ indicates how close r is to being adjacent to some vertex l ∈
FL in the directed equality subgraph Gm,h. Initially, before placing any vertices into FL, set r.σ to ∞ for all r ∈ R.
a. Show how to compute δ in line 10 in O( n) time, based on the σ
attribute.
b. Show how to update all the σ attributes in O( n) time after δ has been computed.
c. Show that updating all the σ attributes when FL changes takes O( n 2) time per call of FIND-AUGMENTING-PATH.
d. Conclude that the HUNGARIAN procedure can be implemented to run in O( n 3) time.
25-3 Other matching problems
The Hungarian algorithm finds a maximum-weight perfect matching in
a complete bipartite graph. It is possible to use the Hungarian
algorithm to solve problems in other graphs by modifying the input
graph, running the Hungarian algorithm, and then possibly modifying
the output. Show how to solve the following matching problems in this
manner.
a. Give an algorithm to find a maximum-weight matching in a weighted
bipartite graph that is not necessarily complete and with all edge
weights positive.
b. Redo part (a), but with edge weights allowed to also be 0 or negative.
c. A cycle cover in a directed graph, not necessarily bipartite, is a set of edge-disjoint directed cycles such that each vertex lies on at most one
cycle. Given nonnegative edge weights w( u, v), let C be the set of edges in a cycle cover, and define w( C) = ∑( u,v)∈ C w( u, v) to be the weight of the cycle cover. Give an algorithm to find a maximum-weight cycle
cover.
25-4 Fractional matchings
It is possible to define a fractional matching. Given a graph G = ( V, E), we define a fractional matching x as a function x : E → [0, 1] (real numbers between 0 and 1, inclusive) such that for every vertex u ∈ V,
we have ∑( u,v)∈ E x( u, v) ≤ 1. The value of a fractional matching is ∑( u, v)∈ E x( u, v). The definition of a fractional matching is identical to that of a matching, except that a matching has the additional constraint that
x( u, v) ∈ {0, 1} for all edges ( u, v) ∈ E. Given a graph, we let M*
denote a maximum matching and x* denote a fractional matching with
maximum value.
a. Argue that, for any bipartite graph, we must have ∑( u, v)∈ E x*( u, v) ≥
| M*|.
b. Prove that, for any bipartite graph, we must have ∑( u, v)∈ E x*( e) ≤
| M*|. ( Hint: Give an algorithm that converts a fractional matching
with an integer value to a matching.) Conclude that the maximum
value of a fractional matching in a bipartite graph is the same as the
size of the maximum cardinality matching.
c. We can define a fractional matching in a weighted graph in the same
manner: the value of the matching is now ∑( u, v)∈ E w( u, v) x( u, v).
Extend the results of the previous parts to show that in a weighted
bipartite graph, the maximum value of a weighted fractional matching
is equal to the value of a maximum weighted matching.
d. In a general graph, the analogous results do not necessarily hold.
Give an example of a small graph that is not bipartite for which the
fractional matching with maximum value is not a maximum
matching.
25-5 Computing vertex labels
You are given a complete bipartite graph G = ( V, E) with edge weights w( l, r) for all ( l, r) ∈ E. You are also given a maximum-weight perfect matching M* for G. You wish to compute a feasible vertex labeling h such that M* is a perfect matching in the equality subgraph Gh. That is, you want to compute a labeling h of vertices such that
(Requirement (25.6) holds for all edges, and the stronger requirement
(25.7) holds for all edges in M*.) Give an algorithm to compute the feasible vertex labeling h, and prove that it is correct. ( Hint: Use the similarity between conditions (25.6) and (25.7) and some of the
properties of shortest paths proved in Chapter 22, in particular the triangle inequality (Lemma 22.10) and the convergence property
(Lemma 22.14.))

Chapter notes
Matching algorithms have a long history and have been central to many
breakthroughs in algorithm design and analysis. The book by Lovász
and Plummer [306] is an excellent reference on matching problems, and the chapter on matching in the book by Ahuja, Magnanti and Orlin [10]
also has extensive references.
The Hopcroft-Karp algorithm is by Hopcroft and Karp [224].
Madry [308] gave an Õ( E 10/7)-time algorithm, which is asymptotically faster than Hopcroft-Karp for sparse graphs.
Corollary 25.4 is due to Berge [53], and it also holds in graphs that are not bipartite. Matching in general graphs requires more complicated
algorithms. The first polynomial-time algorithm, running in O( V 4) time, is due to Edmonds [130] (in a paper that also introduced the notion of a polynomial-time algorithm). Like the bipartite case, this
algorithm also uses augmenting paths, although the algorithm for
finding augmenting paths in general graphs is more involved than the
one for bipartite graphs. Subsequently, several
-time algorithms
appeared, including ones by Gabow and Tarjan [168] as part of an algorithm for weighted matching and a simpler one by Gabow [164].
The Hungarian algorithm is described in the book by Bondy and
Murty [67] and is based on work by Kuhn [273] and Munkres [337].
Kuhn adopted the name “Hungarian algorithm” because the algorithm
derived from work by the Hungarian mathematicians D. Kőnig and J.
Egervéry. The algorithm is an early example of a primal-dual algorithm.
A faster algorithm that runs in
time, where the edge
weights are integers from 0 to W, was given by Gabow and Tarjan [167], and an algorithm with the same time bound for maximum-weight
matching in general graphs was given by Duan, Pettie, and Su [127].
The stable-marriage problem was first defined and analyzed by Gale
and Shapley [169]. The stable-marriage problem has numerous variants.
The books by Gusfield and Irving [203], Knuth [266], and Manlove
[313] serve as excellent sources for cataloging and solving them.
1 The definition of a complete bipartite graph differs from the definition of complete graph given on page 1167 because in a bipartite graph, there are no edges between vertices in L and no edges between vertices in R.
2 Although marriage norms are changing, it’s traditional to view the stable-marriage problem through the lens of heterosexual marriage.
This part contains a selection of algorithmic topics that extend and
complement earlier material in this book. Some chapters introduce new
models of computation such as circuits or parallel computers. Others
cover specialized domains such as matrices or number theory. The last
two chapters discuss some of the known limitations to the design of
efficient algorithms and introduce techniques for coping with those
limitations.
Chapter 26 presents an algorithmic model for parallel computing based on task-parallel computing, and more specifically, fork-join
parallelism. The chapter introduces the basics of the model, showing
how to quantify parallelism in terms of the measures of work and span.
It then investigates several interesting fork-join algorithms, including
algorithms for matrix multiplication and merge sorting.
An algorithm that receives its input over time, rather than having the
entire input available at the start, is called an “online” algorithm.
Chapter 27 examines techniques used in online algorithms, starting with the “toy” problem of how long to wait for an elevator before taking the
stairs. It then studies the “move-to-front” heuristic for maintaining a
linked list and finishes with the online version of the caching problem
we saw back in Section 15.4. The analyses of these online algorithms are remarkable in that they prove that these algorithms, which do not know
their future inputs, perform within a constant factor of optimal
algorithms that know the future inputs.
Chapter 28 studies efficient algorithms for operating on matrices. It presents two general methods—LU decomposition and LUP
decomposition—for solving linear equations by Gaussian elimination in
O( n 3) time. It also shows that matrix inversion and matrix
multiplication can be performed equally fast. The chapter concludes by
showing how to compute a least-squares approximate solution when a
set of linear equations has no exact solution.
Chapter 29 studies how to model problems as linear programs, where the goal is to maximize or minimize an objective, given limited resources
and competing constraints. Linear programming arises in a variety of
practical application areas. The chapter also addresses the concept of
“duality” which, by establishing that a maximization problem and
minimization problem have the same objective value, helps to show that
solutions to each are optimal.
Chapter 30 studies operations on polynomials and shows how to use
a well-known signal-processing technique—the fast Fourier transform
(FFT)—to multiply two degree- n polynomials in O( n lg n) time. It also derives a parallel circuit to compute the FFT.
Chapter 31 presents number-theoretic algorithms. After reviewing elementary number theory, it presents Euclid’s algorithm for computing
greatest common divisors. Next, it studies algorithms for solving
modular linear equations and for raising one number to a power
modulo another number. Then, it explores an important application of
number-theoretic algorithms: the RSA public-key cryptosystem. This
cryptosystem can be used not only to encrypt messages so that an
adversary cannot read them, but also to provide digital signatures. The
chapter finishes with the Miller-Rabin randomized primality test, which
enables finding large primes efficiently—an essential requirement for the
RSA system.
Chapter 32 studies the problem of finding all occurrences of a given pattern string in a given text string, a problem that arises frequently in
text-editing programs. After examining the naive approach, the chapter
presents an elegant approach due to Rabin and Karp. Then, after
showing an efficient solution based on finite automata, the chapter
presents the Knuth-Morris-Pratt algorithm, which modifies the
automaton-based algorithm to save space by cleverly preprocessing the
pattern. The chapter finishes by studying suffix arrays, which can not
only find a pattern in a text string, but can do quite a bit more, such as
finding the longest repeated substring in a text and finding the longest
common substring appearing in two texts.
Chapter 33 examines three algorithms within the expansive field of machine learning. Machine-learning algorithms are designed to take in
vast amounts of data, devise hypotheses about patterns in the data, and
test these hypotheses. The chapter starts with k-means clustering, which
groups data elements into k classes based on how similar they are to each other. It then shows how to use the technique of multiplicative
weights to make predictions accurately based on a set of “experts” of
varying quality. Perhaps surprisingly, even without knowing which
experts are reliable and which are not, you can predict almost as
accurately as the most reliable expert. The chapter finishes with gradient
descent, an optimization technique that finds a local minimum value for
a function. Gradient descent has many applications, including finding
parameter settings for many machine-learning models.
Chapter 34 concerns NP-complete problems. Many interesting
computational problems are NP-complete, but no polynomial-time
algorithm is known for solving any of them. This chapter presents
techniques for determining when a problem is NP-complete, using them
to prove several classic problems NP-complete: determining whether a
graph has a hamiltonian cycle (a cycle that includes every vertex),
determining whether a boolean formula is satisfiable (whether there
exists an assignment of boolean values to its variables that causes the
formula to evaluate to TRUE), and determining whether a given set of
numbers has a subset that adds up to a given target value. The chapter
also proves that the famous traveling-salesperson problem (find a
shortest route that starts and ends at the same location and visits each
of a set of locations once) is NP-complete.
Chapter 35 shows how to find approximate solutions to NP-
complete problems efficiently by using approximation algorithms. For
some NP-complete problems, approximate solutions that are near
optimal are quite easy to produce, but for others even the best
approximation algorithms known work progressively more poorly as the
problem size increases. Then, there are some problems for which investing increasing amounts of computation time yields increasingly
better approximate solutions. This chapter illustrates these possibilities
with the vertex-cover problem (unweighted and weighted versions), an
optimization version of 3-CNF satisfiability, the traveling-salesperson
problem, the set-covering problem, and the subset-sum problem.
The vast majority of algorithms in this book are serial algorithms
suitable for running on a uniprocessor computer that executes only one
instruction at a time. This chapter extends our algorithmic model to
encompass parallel algorithms, where multiple instructions can execute
simultaneously. Specifically, we’ll explore the elegant model of task-
parallel algorithms, which are amenable to algorithmic design and
analysis. Our study focuses on fork-join parallel algorithms, the most
basic and best understood kind of task-parallel algorithm. Fork-join
parallel algorithms can be expressed cleanly using simple linguistic
extensions to ordinary serial code. Moreover, they can be implemented
efficiently in practice.
Parallel computers—computers with multiple processing units—are
ubiquitous. Handheld, laptop, desktop, and cloud machines are all
multicore computers, or simply, multicores, containing multiple
processing “cores.” Each processing core is a full-fledged processor that
can directly access any location in a common shared memory.
Multicores can be aggregated into larger systems, such as clusters, by
using a network to interconnect them. These multicore clusters usually
have a distributed memory, where one multicore’s memory cannot be
accessed directly by a processor in another multicore. Instead, the
processor must explicitly send a message over the cluster network to a
processor in the remote multicore to request any data it requires. The
most powerful clusters are supercomputers, comprising many thousands
of multicores. But since shared-memory programming tends to be
conceptually easier than distributed-memory programming, and
multicore machines are widely available, this chapter focuses on parallel algorithms for multicores.
One approach to programming multicores is thread parallelism. This
processor-centric parallel-programming model employs a software
abstraction of “virtual processors,” or threads that share a common
memory. Each thread maintains its own program counter and can
execute code independently of the other threads. The operating system
loads a thread onto a processing core for execution and switches it out
when another thread needs to run.
Unfortunately, programming a shared-memory parallel computer
using threads tends to be difficult and error-prone. One reason is that it
can be complicated to dynamically partition the work among the
threads so that each thread receives approximately the same load. For
any but the simplest of applications, the programmer must use complex
communication protocols to implement a scheduler that load-balances
the work.
Task-parallel programming
The difficulty of thread programming has led to the creation of task-
parallel platforms, which provide a layer of software on top of threads
to coordinate, schedule, and manage the processors of a multicore.
Some task-parallel platforms are built as runtime libraries, but others
provide full-fledged parallel languages with compiler and runtime
support.
Task-parallel programming allows parallelism to be specified in a
“processor-oblivious” fashion, where the programmer identifies what
computational tasks may run in parallel but does not indicate which
thread or processor performs the task. Thus, the programmer is freed
from worrying about communication protocols, load balancing, and
other vagaries of thread programming. The task-parallel platform
contains a scheduler, which automatically load-balances the tasks across
the processors, thereby greatly simplifying the programmer’s chore.
Task-parallel algorithms provide a natural extension to ordinary serial
algorithms, allowing performance to be reasoned about mathematically
using “work/span analysis.”
Although the functionality of task-parallel environments is still evolving
and increasing, almost all support fork-join parallelism, which is
typically embodied in two linguistic features: spawning and parallel loops. Spawning allows a subroutine to be “forked”: executed like a
subroutine call, except that the caller can continue to execute while the
spawned subroutine computes its result. A parallel loop is like an
ordinary for loop, except that multiple iterations of the loop can execute
at the same time.
Fork-join parallel algorithms employ spawning and parallel loops to
describe parallelism. A key aspect of this parallel model, inherited from
the task-parallel model but different from the thread model, is that the
programmer does not specify which tasks in a computation must run in
parallel, only which tasks may run in parallel. The underlying runtime
system uses threads to load-balance the tasks across the processors. This
chapter investigates parallel algorithms described in the fork-join
model, as well as how the underlying runtime system can schedule task-
parallel computations (which include fork-join computations)
efficiently.
Fork-join parallelism offers several important advantages:
The fork-join programming model is a simple extension of the
familiar serial programming model used in most of this book. To
describe a fork-join parallel algorithm, the pseudocode in this
book needs just three added keywords: parallel, spawn, and sync.
Deleting these parallel keywords from the parallel pseudocode
results in ordinary serial pseudocode for the same problem, which
we call the “serial projection” of the parallel algorithm.
The underlying task-parallel model provides a theoretically clean
way to quantify parallelism based on the notions of “work” and
“span.”
Spawning allows many divide-and-conquer algorithms to be
parallelized naturally. Moreover, just as serial divide-and-conquer
algorithms lend themselves to analysis using recurrences, so do
parallel algorithms in the fork-join model.
The fork-join programming model is faithful to how multicore
programming has been evolving in practice. A growing number of
multicore environments support one variant or another of fork-
join parallel programming, including Cilk [290, 291, 383, 396], Habanero-Java [466], the Java Fork-Join Framework [279], OpenMP [81], Task Parallel Library [289], Threading Building Blocks [376], and X10 [82].
Section 26.1 introduces parallel pseudocode, shows how the
execution of a task-parallel computation can be modeled as a directed
acyclic graph, and presents the metrics of work, span, and parallelism,
which you can use to analyze parallel algorithms. Section 26.2
investigates how to multiply matrices in parallel, and Section 26.3
tackles the tougher problem of designing an efficient parallel merge sort.
26.1 The basics of fork-join parallelism
Our exploration of parallel programming begins with the problem of
computing Fibonacci numbers recursively in parallel. We’ll look at a
straightforward serial Fibonacci calculation, which, although inefficient,
serves as a good illustration of how to express parallelism in
pseudocode.
Recall that the Fibonacci numbers are defined by equation (3.31) on
page 69:
To calculate the n th Fibonacci number recursively, you could use the ordinary serial algorithm in the procedure FIB on the facing page. You
would not really want to compute large Fibonacci numbers this way,
because this computation does needless repeated work, but parallelizing
it can be instructive.
FIB ( n)

1if n ≤ 1
2
return n
3else x = FIB ( n − 1)
4
y = FIB ( n − 2)
5
return x + y
To analyze this algorithm, let T ( n) denote the running time of FIB
( n). Since FIB ( n) contains two recursive calls plus a constant amount of extra work, we obtain the recurrence
T ( n) = T ( n − 1) + T ( n − 2) + Θ(1).
This recurrence has solution T ( n) = Θ( Fn), which we can establish by using the substitution method (see Section 4.3). To show that T ( n) =
O( Fn), we’ll adopt the inductive hypothesis that T ( n) ≤ aFn − b, where a
> 1 and b > 0 are constants. Substituting, we obtain
T ( n) ≤ ( aFn− 1 − b) + ( aFn− 2 − b) + Θ(1)
= a( Fn−1 + Fn−2) − 2 b + Θ(1)
≤ aFn − b,
if we choose b large enough to dominate the upper-bound constant in
the Θ(1) term. We can then choose a large enough to upper-bound the
Θ(1) base case for small n. To show that T ( n) = Ω( Fn), we use the inductive hypothesis T ( n) ≥ aFn − b. Substituting and following reasoning similar to the asymptotic upper-bound argument, we
establish this hypothesis by choosing b smaller than the lower-bound
constant in the Θ(1) term and a small enough to lower-bound the Θ(1)
base case for small n. Theorem 3.1 on page 56 then establishes that T ( n)
= Θ( Fn), as desired. Since Fn = Θ( ϕn), where
is the
golden ratio, by equation (3.34) on page 69, it follows that
Thus this procedure is a particularly slow way to compute Fibonacci
numbers, since it runs in exponential time. (See Problem 31-3 on page
954 for faster ways.)
Let’s see why the algorithm is inefficient. Figure 26.1 shows the tree of recursive procedure instances created when computing F 6 with the
FIB procedure. The call to FIB(6) recursively calls FIB(5) and then
FIB(4). But, the call to FIB(5) also results in a call to FIB(4). Both
instances of FIB(4) return the same result ( F 4 = 3). Since the FIB
procedure does not memoize (recall the definition of “memoize” from
page 368), the second call to FIB(4) replicates the work that the first call
performs, which is wasteful.
Figure 26.1 The invocation tree for FIB(6). Each node in the tree represents a procedure instance whose children are the procedure instances it calls during its execution. Since each instance of FIB with the same argument does the same work to produce the same result, the inefficiency of this algorithm for computing the Fibonacci numbers can be seen by the vast number of repeated calls to compute the same thing. The portion of the tree shaded blue appears in task-parallel form in Figure 26.2.
Although the FIB procedure is a poor way to compute Fibonacci
numbers, it can help us warm up to parallelism concepts. Perhaps the
most basic concept is to understand is that if two parallel tasks operate
on entirely different data, then—absent other interference—they each
produce the same outcomes when executed at the same time as when
they run serially one after the other. Within FIB ( n), for example, the
two recursive calls in line 3 to FIB ( n − 1) and in line 4 to FIB ( n − 2)
can safely execute in parallel because the computation performed by one in no way affects the other.
Parallel keywords
The P-FIB procedure on the next page computes Fibonacci numbers,
but using the parallel keywords spawn and sync to indicate parallelism in
the pseudocode.
If the keywords spawn and sync are deleted from P-FIB, the resulting
pseudocode text is identical to FIB (other than renaming the procedure
in the header and in the two recursive calls). We define the serial
projection 1 of a parallel algorithm to be the serial algorithm that results from ignoring the parallel directives, which in this case can be done by
omitting the keywords spawn and sync. For parallel for loops, which
we’ll see later on, we omit the keyword parallel. Indeed, our parallel
pseudocode possesses the elegant property that its serial projection is
always ordinary serial pseudocode to solve the same problem.
P-FIB ( n)
1 if n ≤ 1
2
return n
3 else x = spawn P-FIB ( n − 1) // don’t wait for subroutine to return
4
y = P-FIB ( n − 2)
// in parallel with spawned subroutine
5
sync
// wait for spawned subroutine to finish
6
return x + y
Semantics of parallel keywords
Spawning occurs when the keyword spawn precedes a procedure call, as
in line 3 of P-FIB. The semantics of a spawn differs from an ordinary
procedure call in that the procedure instance that executes the spawn—
the parent—may continue to execute in parallel with the spawned
subroutine—its child—instead of waiting for the child to finish, as
would happen in a serial execution. In this case, while the spawned child
is computing P-FIB ( n − 1), the parent may go on to compute P-FIB
( n−2) in line 4 in parallel with the spawned child. Since the P-FIB
procedure is recursive, these two subroutine calls themselves create nested parallelism, as do their children, thereby creating a potentially
vast tree of subcomputations, all executing in parallel.
The keyword spawn does not say, however, that a procedure must
execute in parallel with its spawned children, only that it may. The parallel keywords express the logical parallelism of the computation,
indicating which parts of the computation may proceed in parallel. At
runtime, it is up to a scheduler to determine which subcomputations
actually run in parallel by assigning them to available processors as the
computation unfolds. We’ll discuss the theory behind task-parallel
schedulers shortly (on page 759).
A procedure cannot safely use the values returned by its spawned
children until after it executes a sync statement, as in line 5. The
keyword sync indicates that the procedure must wait as necessary for all
its spawned children to finish before proceeding to the statement after
the sync—the “join” of a fork-join parallel computation. The P-FIB
procedure requires a sync before the return statement in line 6 to avoid
the anomaly that would occur if x and y were summed before P-FIB ( n
− 1) had finished and its return value had been assigned to x. In
addition to explicit join synchronization provided by the sync statement,
it is convenient to assume that every procedure executes a sync implicitly
before it returns, thus ensuring that all children finish before their
parent finishes.
A graph model for parallel execution
It helps to view the execution of a parallel computation—the dynamic
stream of runtime instructions executed by processors under the
direction of a parallel program—as a directed acyclic graph G = ( V, E), called a (parallel) trace.2 Conceptually, the vertices in V are executed instructions, and the edges in E represent dependencies between
instructions, where ( u, v) ∈ E means that the parallel program required instruction u to execute before instruction v.
It’s sometimes inconvenient, especially if we want to focus on the
parallel structure of a computation, for a vertex of a trace to represent
only one executed instruction. Consequently, if a chain of instructions
contains no parallel or procedural control (no spawn, sync, procedure call, or return—via either an explicit return statement or the return that
happens implicitly upon reaching the end of a procedure), we group the
entire chain into a single strand. As an example, Figure 26.2 shows the trace that results from computing P-FIB(4) in the portion of Figure 26.1
shaded blue. Strands do not include instructions that involve parallel or
procedural control. These control dependencies must be represented as
edges in the trace.
When a parent procedure calls a child, the trace contains an edge ( u,
v) from the strand u in the parent that executes the call to the first strand v of the spawned child, as illustrated in Figure 26.2 by the edge from the orange strand in P-FIB(4) to the blue strand in P-FIB(2).
When the last strand v′ in the child returns, the trace contains an edge
( v′, u′) to the strand u′, where u′ is the successor strand of u in the parent, as with the edge from the white strand in P-FIB(2) to the white
strand in P-FIB(4).
Figure 26.2 The trace of P-FIB(4) corresponding to the shaded portion of Figure 26.1. Each circle represents one strand, with blue circles representing any instructions executed in the part of the procedure (instance) up to the spawn of P-FIB ( n − 1) in line 3; orange circles representing the instructions executed in the part of the procedure that calls P-FIB ( n − 2) in line 4 up to the sync in line 5, where it suspends until the spawn of P-FIB ( n − 1) returns; and white circles representing the instructions executed in the part of the procedure after the sync, where it sums x and y, up to the point where it returns the result. Strands belonging to the same procedure are grouped into a rounded rectangle, blue for spawned procedures and tan for called procedures. Assuming that each strand takes unit time, the work is 17 time units, since there are 17 strands, and the span is 8 time units, since the critical path—shown with blue edges—
contains 8 strands.
When the parent spawns a child, however, the trace is a little
different. The edge ( u, v) goes from parent to child as with a call, such as the edge from the blue strand in P-FIB(4) to the blue strand in P-FIB(3), but the trace contains another edge ( u, u′) as well, indicating that u’s successor strand u′ can continue to execute while v is executing.
The edge from the blue strand in P-FIB(4) to the orange strand in P-
FIB(4) illustrates one such edge. As with a call, there is an edge from the
last strand v′ in the child, but with a spawn, it no longer goes to u’s successor. Instead, the edge is ( v′, x), where x is the strand immediately following the sync in the parent that ensures that the child has finished,
as with the edge from the white strand in P-FIB(3) to the white strand
in P-FIB(4).
You can figure out what parallel control created a particular trace. If a strand has two successors, one of them must have been spawned, and
if a strand has multiple predecessors, the predecessors joined because of
a sync statement. Thus, in the general case, the set V forms the set of
strands, and the set E of directed edges represents dependencies between
strands induced by parallel and procedural control. If G contains a
directed path from strand u to strand v, we say that the two strands are (logically) in series. If there is no path in G either from u to v or from v to u, the strands are (logically) in parallel.
A fork-join parallel trace can be pictured as a dag of strands
embedded in an invocation tree of procedure instances. For example,
Figure 26.1 shows the invocation tree for FIB(6), which also serves as the invocation tree for P-FIB(6), the edges between procedure instances
now representing either calls or spawns. Figure 26.2 zooms in on the subtree that is shaded blue, showing the strands that constitute each
procedure instance in P-FIB(4). All directed edges connecting strands
run either within a procedure or along undirected edges of the
invocation tree in Figure 26.1. (More general task-parallel traces that are not fork-join traces may contain some directed edges that do not
run along the undirected tree edges.)
Our analyses generally assume that parallel algorithms execute on an
ideal parallel computer, which consists of a set of processors and a sequentially consistent shared memory. To understand sequential
consistency, you first need to know that memory is accessed by load
instructions, which copy data from a location in the memory to a
register within a processor, and by store instructions, which copy data from a processor register to a location in the memory. A single line of
pseudocode can entail several such instructions. For example, the line x
= y + z could result in load instructions to fetch each of y and z from memory into a processor, an instruction to add them together inside the
processor, and a store instruction to place the result x back into
memory. In a parallel computer, several processors might need to load
or store at the same time. Sequential consistency means that even if
multiple processors attempt to access the memory simultaneously, the
shared memory behaves as if exactly one instruction from one of the
processors is executed at a time, even though the actual transfer of data
may happen at the same time. It is as if the instructions were executed one at a time sequentially according to some global linear order among
all the processors that preserves the individual orders in which each
processor executes its own instructions.
For task-parallel computations, which are scheduled onto processors
automatically by a runtime system, the sequentially consistent shared
memory behaves as if a parallel computation’s executed instructions
were executed one by one in the order of a topological sort (see Section
20.4) of its trace. That is, you can reason about the execution by
imagining that the individual instructions (not generally the strands,
which may aggregate many instructions) are interleaved in some linear
order that preserves the partial order of the trace. Depending on
scheduling, the linear order could vary from one run of the program to
the next, but the behavior of any execution is always as if the
instructions executed serially in a linear order consistent with the
dependencies within the trace.
In addition to making assumptions about semantics, the ideal
parallel-computer model makes some performance assumptions.
Specifically, it assumes that each processor in the machine has equal
computing power, and it ignores the cost of scheduling. Although this
last assumption may sound optimistic, it turns out that for algorithms
with sufficient “parallelism” (a term we’ll define precisely a little later),
the overhead of scheduling is generally minimal in practice.
Performance measures
We can gauge the theoretical efficiency of a task-parallel algorithm
using work/span analysis, which is based on two metrics: “work” and
“span.” The work of a task-parallel computation is the total time to execute the entire computation on one processor. In other words, the
work is the sum of the times taken by each of the strands. If each strand
takes unit time, the work is just the number of vertices in the trace. The
span is the fastest possible time to execute the computation on an
unlimited number of processors, which corresponds to the sum of the
times taken by the strands along a longest path in the trace, where
“longest” means that each strand is weighted by its execution time. Such

a longest path is called the critical path of the trace, and thus the span is
the weight of the longest (weighted) path in the trace. (Section 22.2,
pages 617–619 shows how to find a critical path in a dag G = ( V, E) in Θ( V + E) time.) For a trace in which each strand takes unit time, the span equals the number of strands on the critical path. For example, the
trace of Figure 26.2 has 17 vertices in all and 8 vertices on its critical path, so that if each strand takes unit time, its work is 17 time units and
its span is 8 time units.
The actual running time of a task-parallel computation depends not
only on its work and its span, but also on how many processors are
available and how the scheduler allocates strands to processors. To
denote the running time of a task-parallel computation on P processors,
we subscript by P. For example, we might denote the running time of an
algorithm on P processors by TP. The work is the running time on a
single processor, or T 1. The span is the running time if we could run
each strand on its own processor—in other words, if we had an
unlimited number of processors—and so we denote the span by T∞.
The work and span provide lower bounds on the running time TP of
a task-parallel computation on P processors:
In one step, an ideal parallel computer with P processors can do
at most P units of work, and thus in TP time, it can perform at
most P TP work. Since the total work to do is T 1, we have P TP ≥
T 1. Dividing by P yields the work law:
A P-processor ideal parallel computer cannot run any faster than
a machine with an unlimited number of processors. Looked at
another way, a machine with an unlimited number of processors
can emulate a P-processor machine by using just P of its
processors. Thus, the span law follows:
We define the speedup of a computation on P processors by the ratio T 1/ TP, which says how many times faster the computation runs on P
processors than on one processor. By the work law, we have TP ≥ T 1/ P, which implies that T 1/ TP ≤ P. Thus, the speedup on a P-processor ideal parallel computer can be at most P. When the speedup is linear in the
number of processors, that is, when T 1/ TP = Θ( P), the computation exhibits linear speedup. Perfect linear speedup occurs when T 1/ TP = P.
The ratio T 1/ T∞ of the work to the span gives the parallelism of the parallel computation. We can view the parallelism from three
perspectives. As a ratio, the parallelism denotes the average amount of
work that can be performed in parallel for each step along the critical
path. As an upper bound, the parallelism gives the maximum possible
speedup that can be achieved on any number of processors. Perhaps
most important, the parallelism provides a limit on the possibility of
attaining perfect linear speedup. Specifically, once the number of
processors exceeds the parallelism, the computation cannot possibly
achieve perfect linear speedup. To see this last point, suppose that P >
T 1/ T∞, in which case the span law implies that the speedup satisfies T 1/ TP ≤ T 1/ T∞ < P. Moreover, if the number P of processors in the ideal parallel computer greatly exceeds the parallelism—that is, if P ≫
T 1/ T∞—then T 1/ TP ≪ P, so that the speedup is much less than the number of processors. In other words, if the number of processors
exceeds the parallelism, adding even more processors makes the
speedup less perfect.
As an example, consider the computation P-FIB(4) in Figure 26.2,
and assume that each strand takes unit time. Since the work is T 1 = 17
and the span is T∞ = 8, the parallelism is T 1/ T∞ = 17/8 = 2.125.
Consequently, achieving much more than double the performance is
impossible, no matter how many processors execute the computation.
For larger input sizes, however, we’ll see that P-FIB ( n) exhibits
substantial parallelism.
We define the (parallel) slackness of a task-parallel computation
executed on an ideal parallel computer with P processors to be the ratio
( T 1/ T∞)/ P = T 1/( P T∞), which is the factor by which the parallelism of the computation exceeds the number of processors in the machine.
Restating the bounds on speedup, if the slackness is less than 1, perfect
linear speedup is impossible, because T 1/( P T∞) < 1 and the span law imply that T 1/ TP ≤ T 1/ T∞ < P. Indeed, as the slackness decreases from 1 and approaches 0, the speedup of the computation diverges further
and further from perfect linear speedup. If the slackness is less than 1,
additional parallelism in an algorithm can have a great impact on its
execution efficiency. If the slackness is greater than 1, however, the work
per processor is the limiting constraint. We’ll see that as the slackness
increases from 1, a good scheduler can achieve closer and closer to
perfect linear speedup. But once the slackness is much greater than 1,
the advantage of additional parallelism shows diminishing returns.
Scheduling
Good performance depends on more than just minimizing the work and
span. The strands must also be scheduled efficiently onto the processors
of the parallel machine. Our fork-join parallel-programming model
provides no way for a programmer to specify which strands to execute
on which processors. Instead, we rely on the runtime system’s scheduler
to map the dynamically unfolding computation to individual processors.
In practice, the scheduler maps the strands to static threads, and the
operating system schedules the threads on the processors themselves.
But this extra level of indirection is unnecessary for our understanding
of scheduling. We can just imagine that the scheduler maps strands to
processors directly.
A task-parallel scheduler must schedule the computation without
knowing in advance when procedures will be spawned or when they will
finish—that is, it must operate online. Moreover, a good scheduler
operates in a distributed fashion, where the threads implementing the
scheduler cooperate to load-balance the computation. Provably good
online, distributed schedulers exist, but analyzing them is complicated.
Instead, to keep our analysis simple, we’ll consider an online centralized
scheduler that knows the global state of the computation at any
moment.
In particular, we’ll analyze greedy schedulers, which assign as many
strands to processors as possible in each time step, never leaving a
processor idle if there is work that can be done. We’ll classify each step
of a greedy scheduler as follows:
Complete step: At least P strands are ready to execute, meaning that all strands on which they depend have finished execution. A
greedy scheduler assigns any P of the ready strands to the
processors, completely utilizing all the processor resources.
Incomplete step: Fewer than P strands are ready to execute. A greedy scheduler assigns each ready strand to its own processor,
leaving some processors idle for the step, but executing all the
ready strands.
The work law tells us that the fastest running time TP that we can
hope for on P processors must be at least T 1/ P. The span law tells us that the fastest possible running time must be at least T∞. The following
theorem shows that greedy scheduling is provably good in that it
achieves the sum of these two lower bounds as an upper bound.
Theorem 26.1
On an ideal parallel computer with P processors, a greedy scheduler
executes a task-parallel computation with work T 1 and span T∞ in time
Proof Without loss of generality, assume that each strand takes unit
time. (If necessary, replace each longer strand by a chain of unit-time
strands.) We’ll consider complete and incomplete steps separately.
In each complete step, the P processors together perform a total of P
work. Thus, if the number of complete steps is k, the total work
executing all the complete steps is kP. Since the greedy scheduler
doesn’t execute any strand more than once and only T 1 work needs to
be performed, it follows that kP ≤ T 1, from which we can conclude that the number k of complete steps is at most T 1/ P.

Now, let’s consider an incomplete step. Let G be the trace for the
entire computation, let G′ be the subtrace of G that has yet to be executed at the start of the incomplete step, and let G″ be the subtrace
remaining to be executed after the incomplete step. Consider the set R
of strands that are ready at the beginning of the incomplete step, where
| R| < P. By definition, if a strand is ready, all its predecessors in trace G
have executed. Thus the predecessors of strands in R do not belong to
G′. A longest path in G′ must necessarily start at a strand in R, since every other strand in G′ has a predecessor and thus could not start a longest path. Because the greedy scheduler executes all ready strands
during the incomplete step, the strands of G″ are exactly those in G′
minus the strands in R. Consequently, the length of a longest path in G″
must be 1 less than the length of a longest path in G′. In other words,
every incomplete step decreases the span of the trace remaining to be
executed by 1. Hence, the number of incomplete steps can be at most
T∞.Since each step is either complete or incomplete, the theorem follows.
▪
The following corollary shows that a greedy scheduler always
performs well.
Corollary 26.2
The running time TP of any task-parallel computation scheduled by a
greedy scheduler on a P-processor ideal parallel computer is within a factor of 2 of optimal.
Proof Let T* P be the running time produced by an optimal scheduler on a machine with P processors, and let T 1 and T∞ be the work and span of the computation, respectively. Since the work and span laws—
inequalities (26.2) and (26.3)—give
, Theorem 26.1
implies that
The next corollary shows that, in fact, a greedy scheduler achieves
near-perfect linear speedup on any task-parallel computation as the
slackness grows.
Corollary 26.3
Let TP be the running time of a task-parallel computation produced by
a greedy scheduler on an ideal parallel computer with P processors, and
let T 1 and T∞be the work and span of the computation, respectively.
Then, if P ≪ T 1/ T∞, or equivalently, the parallel slackness is much greater than 1, we have TP ≈ T 1/ P, a speedup of approximately P.
Proof If we suppose that P ≪ T 1/ T∞, then it follows that T∞ ≪ T 1/ P, and hence Theorem 26.1 gives TP ≤ T 1/ P + T∞ ≈ T 1/ P. Since the work law (26.2) dictates that TP ≥ T 1/ P, we conclude that TP ≈ T 1/ P, which is a speedup of T 1/ TP ≈ P.
▪
The ≪ symbol denotes “much less,” but how much is “much less”?
As a rule of thumb, a slackness of at least 10—that is, 10 times more
parallelism than processors—generally suffices to achieve good speedup.
Then, the span term in the greedy bound, inequality (26.4), is less than
10% of the work-per-processor term, which is good enough for most
engineering situations. For example, if a computation runs on only 10 or
100 processors, it doesn’t make sense to value parallelism of, say
1,000,000, over parallelism of 10,000, even with the factor of 100
difference. As Problem 26-2 shows, sometimes reducing extreme
parallelism yields algorithms that are better with respect to other
concerns and which still scale up well on reasonable numbers of
processors.
Analyzing parallel algorithms
We now have all the tools we need to analyze parallel algorithms using
work/span analysis, allowing us to bound an algorithm’s running time
on any number of processors. Analyzing the work is relatively
straightforward, since it amounts to nothing more than analyzing the
running time of an ordinary serial algorithm, namely, the serial
projection of the parallel algorithm. You should already be familiar
with analyzing work, since that is what most of this textbook is about!
Analyzing the span is the new thing that parallelism engenders, but it’s
generally no harder once you get the hang of it. Let’s investigate the
basic ideas using the P-FIB program.
Analyzing the work T 1( n) of P-FIB ( n) poses no hurdles, because we’ve already done it. The serial projection of P-FIB is effectively the
original FIB procedure, and hence, we have T 1( n) = T ( n) = Θ( ϕn) from equation (26.1).
Figure 26.3 illustrates how to analyze the span. If two traces are joined in series, their spans add to form the span of their composition,
whereas if they are joined in parallel, the span of their composition is
the maximum of the spans of the two traces. As it turns out, the trace of
any fork-join parallel computation can be built up from single strands
by series-parallel composition.
Figure 26.3 Series-parallel composition of parallel traces. (a) When two traces are joined in series, the work of the composition is the sum of their work, and the span of the composition is the sum of their spans. (b) When two traces are joined in parallel, the work of the composition remains the sum of their work, but the span of the composition is only the maximum of their spans.
Armed with an understanding of series-parallel composition, we can
analyze the span of P-FIB ( n). The spawned call to P-FIB ( n − 1) in line
3 runs in parallel with the call to P-FIB ( n − 2) in line 4. Hence, we can
express the span of P-FIB ( n) as the recurrence
T∞( n) = max { T∞( n − 1), T∞( n − 2)} + Θ(1)
= T∞( n − 1) + Θ(1),
which has solution T∞( n) = Θ( n). (The second equality above follows from the first because P-FIB ( n − 1) uses P-FIB ( n − 2) in its computation, so that the span of P-FIB ( n − 1) must be at least as large
as the span of P-FIB ( n − 2).)
The parallelism of P-FIB ( n) is T 1( n)/ T∞( n) = Θ( ϕn/ n), which grows dramatically as n gets large. Thus, Corollary 26.3 tells us that on even
the largest parallel computers, a modest value for n suffices to achieve
near perfect linear speedup for P-FIB ( n), because this procedure
exhibits considerable parallel slackness.
Parallel loops
Many algorithms contain loops for which all the iterations can operate
in parallel. Although the spawn and sync keywords can be used to
parallelize such loops, it is more convenient to specify directly that the
iterations of such loops can run in parallel. Our pseudocode provides
this functionality via the parallel keyword, which precedes the for
keyword in a for loop statement.
As an example, consider the problem of multiplying a square n × n
matrix A = ( aij) by an n-vector x = ( xj). The resulting n-vector y = ( yi) is given by the equation
for i = 1, 2, … , n. The P-MAT-VEC procedure performs matrix-vector
multiplication (actually, y = y + Ax) by computing all the entries of y in parallel. The parallel for keywords in line 1 of P-MAT-VEC indicate
that the n iterations of the loop body, which includes a serial for loop,
may be run in parallel. The initialization y = 0, if desired, should be
performed before calling the procedure (and can be done with a parallel for loop).
P-MAT-VEC ( A, x, y, n)
1 parallel for i = 1 to n
// parallel loop
2
for j = 1 to n
// serial loop
3
yi = yi + aij xj
Compilers for fork-join parallel programs can implement parallel for
loops in terms of spawn and sync by using recursive spawning. For
example, for the parallel for loop in lines 1–3, a compiler can generate
the auxiliary subroutine P-MAT-VEC-RECURSIVE and call P-MAT-
VEC-RECURSIVE ( A, x, y, n, 1, n) in the place where the loop would be in the compiled code. As Figure 26.4 illustrates, this procedure recursively spawns the first half of the iterations of the loop to execute
in parallel (line 5) with the second half of the iterations (line 6) and then
executes a sync (line 7), thereby creating a binary tree of parallel
execution. Each leaf represents a base case, which is the serial for loop
of lines 2–3.
P-MAT-VEC-RECURSIVE ( A, x, y, n, i, i′) 1 if i == i′
// just one iteration to do?
2
for j = 1 to n
// mimic P-MAT-VEC serial loop
3
yi = yi + aij xj
4 else mid = ⌊( i + i′)/2⌊
// parallel divide-and-conquer
5
spawn P-MAT-VEC-RECURSIVE ( A, x, y, n, i, mid) 6
P-MAT-VEC-RECURSIVE ( A, x, y, n, mid + 1, i′) 7
sync
To calculate the work T 1( n) of P-MAT-VEC on an n× n matrix, simply compute the running time of its serial projection, which comes
from replacing the parallel for loop in line 1 with an ordinary for loop.
The running time of the resulting serial pseudocode is Θ( n 2), which
means that T 1( n) = Θ( n 2). This analysis seems to ignore the overhead for recursive spawning in implementing the parallel loops, however.
Indeed, the overhead of recursive spawning does increase the work of a
parallel loop compared with that of its serial projection, but not
asymptotically. To see why, observe that since the tree of recursive
procedure instances is a full binary tree, the number of internal nodes is
one less than the number of leaves (see Exercise B.5-3 on page 1175).
Each internal node performs constant work to divide the iteration
range, and each leaf corresponds to a base case, which takes at least
constant time (Θ( n) time in this case). Thus, by amortizing the overhead
of recursive spawning over the work of the iterations in the leaves, we
see that the overall work increases by at most a constant factor.
Figure 26.4 A trace for the computation of P-MAT-VEC-RECURSIVE ( A, x, y, 8, 1, 8). The two numbers within each rounded rectangle give the values of the last two parameters ( i and i′ in the procedure header) in the invocation (spawn, in blue, or call, in tan) of the procedure. The blue circles represent strands corresponding to the part of the procedure up to the spawn of P-MAT-VEC-RECURSIVE in line 5. The orange circles represent strands corresponding to the
part of the procedure that calls P-MAT-VEC-RECURSIVE in line 6 up to the sync in line 7, where it suspends until the spawned subroutine in line 5 returns. The white circles represent strands corresponding to the (negligible) part of the procedure after the sync up to the point where it returns.
To reduce the overhead of recursive spawning, task-parallel
platforms sometimes coarsen the leaves of the recursion by executing
several iterations in a single leaf, either automatically or under
programmer control. This optimization comes at the expense of
reducing the parallelism. If the computation has sufficient parallel
slackness, however, near-perfect linear speedup won’t be sacrificed.
Although recursive spawning doesn’t affect the work of a parallel
loop asymptotically, we must take it into account when analyzing the
span. Consider a parallel loop with n iterations in which the i th iteration has span iter∞( i). Since the depth of recursion is logarithmic in the number of iterations, the parallel loop’s span is
T∞( n) = Θ(lg n) + max { iter∞( i) : 1 ≤ i ≤ n}.
For example, let’s compute the span of the doubly nested loops in
lines 1–3 of P-MAT-VEC. The span for the parallel for loop control is
Θ(lg n). For each iteration of the outer parallel loop, the inner serial for
loop contains n iterations of line 3. Since each iteration takes constant
time, the total span for the inner serial for loop is Θ( n), no matter which
iteration of the outer parallel for loop it’s in. Thus, taking the maximum
over all iterations of the outer loop and adding in the Θ(lg n) for loop
control yields an overall span of T∞ n = Θ( n) + Θ(lg n) = Θ( n) for the procedure. Since the work is Θ( n 2), the parallelism is Θ( n 2)/Θ( n) = Θ( n).
(Exercise 26.1-7 asks you to provide an implementation with even more
parallelism.)
Race conditions
A parallel algorithm is deterministic if it always does the same thing on
the same input, no matter how the instructions are scheduled on the
multicore computer. It is nondeterministic if its behavior might vary
from run to run when the input is the same. A parallel algorithm that is
intended to be deterministic may nevertheless act nondeterministically,
however, if it contains a difficult-to-diagnose bug called a “determinacy
race.”
Famous race bugs include the Therac-25 radiation therapy machine,
which killed three people and injured several others, and the Northeast
Blackout of 2003, which left over 50 million people in the United States
without power. These pernicious bugs are notoriously hard to find. You
can run tests in the lab for days without a failure, only to discover that
your software sporadically crashes in the field, sometimes with dire
consequences.
A determinacy race occurs when two logically parallel instructions
access the same memory location and at least one of the instructions
modifies the value stored in the location. The toy procedure RACE-
EXAMPLE on the following page illustrates a determinacy race. After
initializing x to 0 in line 1, RACE-EXAMPLE creates two parallel
strands, each of which increments x in line 3. Although it might seem
that a call of RACE-EXAMPLE should always print the value 2 (its
serial projection certainly does), it could instead print the value 1. Let’s
see how this anomaly might occur.
When a processor increments x, the operation is not indivisible, but
is composed of a sequence of instructions:
Figure 26.5 Illustration of the determinacy race in RACE-EXAMPLE. (a) A trace showing the dependencies among individual instructions. The processor registers are r 1 and r 2. Instructions unrelated to the race, such as the implementation of loop control, are omitted. (b) An execution sequence that elicits the bug, showing the values of x in memory and registers r 1 and r 2 for each step in the execution sequence.
RACE-EXAMPLE ( )
1 x = 0
2 parallel for i = 1 to 2
3
x = x + 1
// determinacy race
Load x from memory into one of the processor’s registers.
Increment the value in the register.
Store the value in the register back into x in memory.
Figure 26.5(a) illustrates a trace representing the execution of RACE-EXAMPLE, with the strands broken down to individual instructions.
Recall that since an ideal parallel computer supports sequential
consistency, you can view the parallel execution of a parallel algorithm
as an interleaving of instructions that respects the dependencies in the
trace. Part (b) of the figure shows the values in an execution of the
computation that elicits the anomaly. The value x is kept in memory, and r 1 and r 2 are processor registers. In step 1, one of the processors sets x to 0. In steps 2 and 3, processor 1 loads x from memory into its register r 1 and increments it, producing the value 1 in r 1. At that point, processor 2 comes into the picture, executing instructions 4–6. Processor
2 loads x from memory into register r 2; increments it, producing the value 1 in r 2; and then stores this value into x, setting x to 1. Now, processor 1 resumes with step 7, storing the value 1 in r 1 into x, which leaves the value of x unchanged. Therefore, step 8 prints the value 1, rather than the value 2 that the serial projection would print.
Let’s recap what happened. By sequential consistency, the effect of
the parallel execution is as if the executed instructions of the two
processors are interleaved. If processor 1 executes all its instructions
before processor 2, a trivial interleaving, the value 2 is printed.
Conversely, if processor 2 executes all its instructions before processor 1,
the value 2 is still printed. When the instructions of the two processors
interleave nontrivially, however, it is possible, as in this example
execution, that one of the updates to x is lost, resulting in the value 1
being printed.
Of course, many executions do not elicit the bug. That’s the problem
with determinacy races. Generally, most instruction orderings produce
correct results, such as any where the instructions on the left branch
execute before the instructions on the right branch, or vice versa. But some orderings generate improper results when the instructions
interleave. Consequently, races can be extremely hard to test for. Your
program may fail, but you may be unable to reliably reproduce the
failure in subsequent tests, confounding your attempts to locate the bug
in your code and fix it. Task-parallel programming environments often
provide race-detection productivity tools to help you isolate race bugs.
Many parallel programs in the real world are intentionally
nondeterministic. They contain determinacy races, but they mitigate the
dangers of nondeterminism through the use of mutual-exclusion locks
and other methods of synchronization. For our purposes, however, we’ll
insist on an absence of determinacy races in the algorithms we develop.
Nondeterministic programs are indeed interesting, but nondeterministic
programming is a more advanced topic and unnecessary for a wide
swath of interesting parallel algorithms.
To ensure that algorithms are deterministic, any two strands that
operate in parallel should be mutually noninterfering: they only read, and do not modify, any memory locations accessed by both of them.
Consequently, in a parallel for construct, such as the outer loop of P-
MAT-VEC, we want all the iterations of the body, including any code
an iteration executes in subroutines, to be mutually noninterfering. And
between a spawn and its corresponding sync, we want the code executed
by the spawned child and the code executed by the parent to be
mutually noninterfering, once again including invoked subroutines.
As an example of how easy it is to write code with unintentional