The goal is to find a perfect matching M* (see Exercises 25.1-5 and

25.1-6) whose edges have the maximum total weight over all perfect

matchings. That is, letting w( M) = ∑( l, r)∈ M w( l, r) denote the total weight of the edges in matching M, we want to find a perfect matching

M* such that

w( M*) = max { w( M) : M is a perfect matching}.

We call finding such a maximum-weight perfect matching the

assignment problem. A solution to the assignment problem is a perfect

matching that maximizes the total utility. Like the stable-marriage

problem, the assignment problem finds a matching that is “good,” but

with a different definition of good: maximizing total value rather than

achieving stability.

Although you could enumerate all n! perfect matchings to solve the

assignment problem, an algorithm known as the Hungarian algorithm

solves it much faster. This section will prove an O( n 4) time bound, and Problem 25-2 asks you to refine the algorithm to reduce the running

time to O( n 3). Instead of working with the complete bipartite graph G, the Hungarian algorithm works with a subgraph of G called the

“equality subgraph.” The equality subgraph, which is defined below,

changes over time and has the beneficial property that any perfect

matching in the equality subgraph is also an optimal solution to the

assignment problem.

The equality subgraph depends on assigning an attribute h to each

vertex. We call h the label of a vertex, and we say that h is a feasible vertex labeling of G if

l. h + r. hw( l, r) for all lL and rR.

A feasible vertex labeling always exists, such as the default vertex

labeling given by

Image 845

Image 846

Given a feasible vertex labeling h, the equality subgraph Gh = ( V, Eh) of G consists of the same vertices as G and the subset of edges

Eh = {( l, r) ∈ E : l. h + r. h = w( l, r)}.

The following theorem ties together a perfect matching in an

equality subgraph and an optimal solution to the assignment problem.

Theorem 25.14

Let G = ( V, E), where V = LR, be a complete bipartite graph where each edge ( l, r) ∈ E has weight w( l, r). Let h be a feasible vertex labeling of G and Gh be the equality subgraph of G. If Gh contains a perfect matching M*, then M* is an optimal solution to the assignment problem on G.

Proof If Gh contains a perfect matching M*, then because Gh and G

have the same sets of vertices, M* is also a perfect matching in G.

Because each edge of M* belongs to Gh and each vertex has exactly one

incident edge from any perfect matching, we have

Letting M be any perfect matching in G, we have

Thus, we have

Image 847

so that M* is a maximum-weight perfect matching in G.

The goal now becomes finding a perfect matching in an equality

subgraph. Which equality subgraph? It does not matter! We have free

rein to not only choose an equality subgraph, but to change which

equality subgraph we choose as we go along. We just need to find some

perfect matching in some equality subgraph.

To understand the equality subgraph better, consider again the proof

of Theorem 25.14 and, in the second half, let M be any matching. The

proof is still valid, in particular, inequality (25.3): the weight of any

matching is always at most the sum of the vertex labels. If we choose any

set of vertex labels that define an equality subgraph, then a maximum-

cardinality matching in this equality subgraph has total value at most

the sum of the vertex labels. If the set of vertex labels is the “right” one,

then it will have total value equal to w( M*), and a maximum-cardinality matching in the equality subgraph is also a maximum-weight perfect

matching. The Hungarian algorithm repeatedly modifies the matching

and the vertex labels in order to achieve this goal.

The Hungarian algorithm starts with any feasible vertex labeling h

and any matching M in the equality subgraph Gh. It repeatedly finds an

M-augmenting path P in Gh and, using Lemma 25.1, updates the matching to be MP, thereby incrementing the size of the matching.

As long as there is some equality subgraph that contains an M-

augmenting path, the size of the matching can increase, until a perfect

matching is achieved.

Four questions arise:

1. What initial feasible vertex labeling should the algorithm start

with? Answer: the default vertex labeling given by equations

(25.1) and (25.2).

2. What initial matching in Gh should the algorithm start with?

Short answer: any matching, even an empty matching, but a

greedy maximal matching works well.

3. If an M-augmenting path exists in Gh, how to find it? Short answer: use a variant of breadth-first search similar to the

second phase of the procedure used in the Hopcroft-Karp

algorithm to find a maximal set of shortest M-augmenting paths.

4. What if the search for an M-augmenting path fails? Short

answer: update the feasible vertex labeling to bring in at least one

new edge.

We’ll elaborate on the short answers using the example that starts in

Figure 25.4. Here, L = { l 1, l 2, … , l 7} and R = { r 1, r 2, … , r 7}. The edge weights appear in the matrix shown in part (a), where the weight

w( li, rj) appears in row i and column j. The feasible vertex labels, given by the default vertex labeling, appear to the left of and above the

matrix. Matrix entries in red indicate edges ( li, rj) for which li. h + rj. h =

w( li, r j), that is, edges in the equality subgraph Gh appearing in part (b) of the figure.

Greedy maximal bipartite matching

There are several ways to implement a greedy method to find a maximal

bipartite

matching.

The

procedure

GREEDY-BIPARTITE-

MATCHING shows one. Edges in Figure 25.4(b) highlighted in blue indicate the initial greedy maximal matching in Gh. Exercise 25.3-2 asks

you to show that the GREEDY-BIPARTITE-MATCHING procedure

returns a matching that is at least half the size of a maximum matching.

GREEDY-BIPARTITE-MATCHING ( G)

1 M = Ø

2 for each vertex lL

3

if l has an unmatched neighbor in R

4

choose any such unmatched neighbor rR

5

M = M ∪ {( l, r)}

6 return M

Image 848

Figure 25.4 The start of the Hungarian algorithm. (a) The matrix of edge weights for a bipartite graph with L = { l 1, l 2, … , l 7}. The value in row i and column j indicates w( li, rj). Feasible vertex labels appear above and next to the matrix. Red entries correspond to edges in the equality subgraph. (b) The equality subgraph Gh. Edges highlighted in blue belong to the initial greedy maximal matching M. Blue vertices are matched, and tan vertices are unmatched. (c) The directed equality subgraph GM,h created from Gh by directing edges in M from R to L and all other edges from L to R.

Finding an M-augmenting path in Gh

To find an M-augmenting path in the equality subgraph Gh with a matching M, the Hungarian algorithm first creates the directed equality

subgraph GM,h from Gh, just as the Hopcroft-Karp algorithm creates GM from G. As in the Hopcroft-Karp algorithm, you can think of an

M-augmenting path as starting from an unmatched vertex in L, ending

at an unmatched vertex in R, taking unmatched edges from L to R, and taking matched edges from R to L. Thus, GM,h = ( V, EM,h), where EM,h={( l, r) : lL, rR, and ( l, r) ∈ EhM }(edges from L to R)

∪ {( r, l) : rR, lL, and ( l, r) ∈ M }

(edges from R to L).

Because an M-augmenting path in the directed equality subgraph GM.h

is also an M-augmenting path in the equality subgraph Gh, it suffices to find M-augmenting paths in GM.h. Figure 25.4(c) shows the directed equality subgraph GM,h corresponding to the equality subgraph Gh

and matching M from part (b) of the figure.

With the directed equality subgraph GM,h in hand, the Hungarian

algorithm searches for an M-augmenting path from any unmatched

vertex in L to any unmatched vertex in R. Any exhaustive graph-search

method suffices. Here, we’ll use breadth-first search, starting from all

the unmatched vertices in L (just as the Hopcroft-Karp algorithm does

when creating the dag H), but stopping upon first discovering some

unmatched vertex in R. Figure 25.5 shows the idea. To start from all the unmatched vertices in L, initialize the first-in, first-out queue with all the unmatched vertices in L, rather than just one source vertex. Unlike

the dag H in the Hopcroft-Karp algorithm, here each vertex needs just

one predecessor, so that the breadth-first search creates a breadth-first

forest F = ( VF, EF). Each unmatched vertex in L is a root in F.

In Figure 25.5(g), the breadth-first search has found the M-

augmenting path 〈( l 4, r 2), ( r 2, l 1), ( l 1, r 3), ( r 3, l 6), ( l 6, r 5)〉. Figure

25.6(a) shows the new matching created by taking the symmetric

difference of the matching M in Figure 25.5(a) with this M-augmenting path.

When the search for an M-augmenting path fails

Having updated the matching M from an M-augmenting path, the

Hungarian algorithm updates the directed equality subgraph GM,h

according to the new matching and then starts a new breadth-first

search from all the unmatched vertices in L. Figure 25.6 shows the start of this process, picking up from Figure 25.5.

In Figure 25.6(d), the queue contains vertices l 4 and l 3. Neither of these vertices has an edge that leaves it, however, so that once these

vertices are removed from the queue, the queue becomes empty. The

Image 849

search terminates at this point, before discovering an unmatched vertex

in R to yield an M-augmenting path. Whenever this situation occurs, the most recently discovered vertices must belong to L. Why? Whenever

an unmatched vertex in R is discovered, the search has found an M-

augmenting path, and when a matched vertex in R is discovered, it has

an unvisited neighbor in L, which the search can then discover.

Recall that we have the freedom to work with any equality subgraph.

We can change the directed equality subgraph “on the fly,” as long we

do not counteract the work already done. The Hungarian algorithm

updates the feasible vertex labeling h to fulfill the following criteria:

1. No edge in the breadth-first forest F leaves the directed equality

subgraph.

2. No edge in the matching M leaves the directed equality

subgraph.

3. At least one edge ( l, r), where lLVF and rRVF goes into Eh, and hence into EM,h. Therefore, at least one vertex in R

will be newly discovered.

Thus, at least one new edge enters the directed equality subgraph, and

any edge that leaves the directed equality subgraph belongs to neither

the matching M nor the breadth-first forest F. Newly discovered vertices in R are enqueued, but their distances are not necessarily 1 greater than

the distances of the most recently discovered vertices in L.

Image 850

Figure 25.5 Finding an M-augmenting path in GM,h by breadth-first search. (a) The directed equality subgraph GM,h from Figure 25.4(c). (b)–(g) Successive versions of the breadth-first forest F, shown as the vertices at each distance from the roots—the unmatched vertices in L

are discovered. In parts (b)–(f), the layer of vertices closest to the bottom of the figure are those in the first-in, first-out queue. For example, in part (b), the queue contains the roots 〈 l 4, l 5, l 7〉, and in part (e), the queue contains 〈 r 3, r 4〉, at distance 3 from the roots. In part (g), the unmatched vertex r 5 is discovered, so the breadth-first search terminates. The path 〈( l 4, r 2), ( r 2, l 1), ( l 1, r 3), ( r 3, l 6), ( l 6, r 5)〉, highlighted in orange in parts (a) and (g), is an M-augmenting path. Taking its symmetric difference with the matching M yields a new matching with one more edge than M.

Image 851

Image 852

Figure 25.6 (a) The new matching M and the new directed equality subgraph GM.h after updating the matching in Figure 25.5(a) with the M-augmenting path in Figure 25.5(g). (b)–(d) Successive versions of the breadth-first forest F in a new breadth-first search with roots l 5 and l 7. After the vertices l 4 and l 3 in part (d) have been removed from the queue, the queue becomes empty before the search can discover an unmatched vertex in R.

To update the feasible vertex labeling, the Hungarian algorithm first

computes the value

where FL = LVF and FR = RVF denote the vertices in the breadth-first forest F that belong to L and R, respectively. That is, δ is the smallest difference by which an edge incident on a vertex in FL

missed being in the current equality subgraph Gh. The Hungarian

algorithm then creates a new feasible vertex labeling, say h′, by

subtracting δ from l. h for all vertices lFL and adding δ to r. h for all vertices rFR:

Image 853

The following lemma shows that these changes achieve the three criteria

above.

Lemma 25.15

Let h be a feasible vertex labeling for the complete bipartite graph G

with equality subgraph Gh, and let M be a matching for Gh and F be a breadth-first forest being constructed for the directed equality subgraph

GM,h. Then, the labeling h′ in equation (25.5) is a feasible vertex labeling for G with the following properties:

1. If ( u, v) is an edge in the breadth-first forest F for GM,h, then ( u, v) ∈ EM,h′.

2. If ( l, r) belongs to the matching M for Gh, then ( r, l) ∈ EM,h′.

3. There exist vertices lFL and rRFR such that ( l, r) ∉

EM,h but ( l, r) ∈ EM,h′.

Proof We first show that h′ is a feasible vertex labeling for G. Because h is a feasible vertex labeling, we have l. h + r. hw( l, r) for all lL and r

R. In order for h′ to not be a feasible vertex labeling, we would need l. h′ + r. h′ < l. h + r. h for some lL and rR. The only way this could occur would be for some lFL and rRFR. In this instance, the amount of the decrease equals δ, so that l. h′ + r. h′ = l. hδ + r. h. By equation (25.4), we have that l. hδ+ r. hw( l, r) for any lFL and r

RFR, so that l. h′+ r. h′ ≥ w( l, r). For all other edges, we have l. h′ + r. h′ ≥

l. h+ r. hw( l, r). Thus, h′ is a feasible vertex labeling.

Now we show that each of the three desired properties holds:

1. If lFL and rFR, then we have l. h′+ r. h′ = l. h+ r. h because δ

is added to the label of l and subtracted from the label of r.

Therefore, if an edge belongs to F for the directed graph GM,h, it also belongs to GM,h′.

2. We claim that at the time the Hungarian algorithm computes the

new feasible vertex labeling h′, for every edge ( l, r) ∈ M, we have lFL if and only if rFR. To see why, consider a matched vertex r and let ( l, r) ∈ M. First suppose that rFR, so that the search discovered r and enqueued it. When r was removed from

the queue, l was discovered, so lFL. Now suppose that r

FR, so r is undiscovered. We will show that lFL. The only edge in GM,h that enters l is ( r, l), and since r is undiscovered, the search has not taken this edge; if lFL, it is not because of

the edge ( r, l). The only other way that a vertex in L can be in FL

is if it is a root of the search, but only unmatched vertices in L

are roots and l is matched. Thus, lFL, and the claim is proved.

We already saw that lFL and rFR implies l. h′ + r. h′ = l. h +

r. h. For the opposite case, when lLFL and RRFR, we have that l. h′ = l. h and r. h′ = r. h, so that again l. h′ + r. h′ = l. h +

r. h. Thus, if edge ( l, r) is in the matching M for the equality graph Gh, then ( r, l) ∈ EM,h′.

3. Let ( l, r) be an edge not in Eh such that lFL, rRFR, and δ = l. h + r. hw( l, r). By the definition of δ, there is at least one such edge. Then, we have

l. h′ + r. h′= l. hδ + r. h

= l. h − ( l. h + r. hw( l, r)) + r. h

= w( l, r),

and thus ( l, r) ∈ Eh′. Since ( l, r) is not in Eh, it is not in the matching M, so that in EM,h′ it must be directed from L to R.

Thus, ( l, r) ∈ EM,h′.

It is possible for an edge to belong to EM,h but not to EM,h′. By Lemma 25.15, any such edge belongs neither to the matching M nor to

the breadth-first forest F at the time that the new feasible vertex labeling

h′ is computed. (See Exercise 25.3-3.)

Going back to Figure 25.6(d), the queue became empty before an M-

augmenting path was found. Figure 25.7 shows the next steps taken by the algorithm. The value of δ = 1 is achieved by the edge ( l 5, r 3) because

in Figure 25.4(a), l 5. h + r 3. hw( l 5, r 3) = 6 + 0 − 5 = 1. In Figure

25.7(a), the values of l 3. h, l 4. h, l 5. h, and l 7. h have decreased by 1 and

the values of r 2. h and r 7. h have increased by 1 because these vertices are in F. As a result, the edges ( l 1, r 2) and ( l 6, r 7) leave GM,h and the edge ( l 5, r 3) enters. Figure 25.7(b) shows the new directed equality subgraph GM,h. With edge ( l 5, r 3) now in GM,h, Figure 25.7(c) shows that this edge is added to the breadth-first forest F, and r 3 is added to the queue.

Parts (c)–(f) show the breadth-first forest continuing to be built until in

part (f), the queue once again becomes empty after vertex l 2, which has

no edges leaving, is removed. Again, the algorithm must update the

feasible vertex labeling and the directed equality subgraph. Now the

value of δ = 1 is achieved by three edges: ( l 1, r 6), ( l 5, r 6), and ( l 7, r 6).

As Figure 25.8 shows in parts (a) and (b), these edges enter GM,h, and edge ( l 6, r 3) leaves. Part (c) shows that edge ( l 1, r 6) is added to the breadth-first forest. (Either of edges ( l 5, r 6) or ( l 7, r 6) could have been added instead.) Because r 6 is unmatched, the search has found the M-

augmenting path 〈( l 5, r 3), ( r 3, l 1), ( l 1, r 6)〉, highlighted in orange.

Figure 25.9(a) shows GM,h after the matching M has been updated by taking its symmetric difference with the M-augmenting path. The

Hungarian algorithm starts its last breadth-first search, with vertex l 7 as

the only root. The search proceeds as shown in parts (b)–(h) of the

figure, until the queue becomes empty after removing l 4. This time, we

find that δ = 2, achieved by the five edges ( l 2, r 5), ( l 3, r 1), ( l 4, r 5), ( l 5, r 1), and ( l 5, r 5), each of which enters GM,h. Figure 25.10(a) shows the

Image 854

results of decreasing the feasible vertex label of each vertex in FL by 2

and increasing the feasible vertex label of each vertex in F R by 2, and

Figure 25.10(b) shows the resulting directed equality subgraph GM,h.

Part (c) shows that edge ( l 3, r 1) is added to the breadth-first forest.

Since r 1 is an unmatched vertex, the search terminates, having found the

M-augmenting path 〈( l 7, r 7), ( r 7, l 3), ( l 3, r 1)〉, highlighted in orange. If r 1 had been matched, vertex r 5 would also have been added to the breadth-first forest, with any of l 2, l 4, or l 5 as its parent.

Figure 25.7 Updating the feasible vertex labeling and the directed equality subgraph GM,h when the queue becomes empty before finding an M-augmenting path. (a) With δ = 1, the values of l 3. h, l 4. h, l 5. h, and l 7. h decreased by 1 and r 2. h and r 7. h increased by 1. Edges ( l 1, r 2) and ( l 6, r 7) leave GM,h, and edge ( l 5, r 3) enters. These changes are highlighted in yellow. (b) The resulting directed equality subgraph GM,h. (c)–(f) With edge ( l 5, r 3) added to the breadth-first forest and r 3 added to the queue, the breadth-first search continues until the queue once again becomes empty in part (f).

After updating the matching M, the algorithm arrives at the perfect

matching shown for the equality subgraph Gh in Figure 25.11. By

Image 855

Theorem 25.14, the edges in M form an optimal solution to the original

assignment problem given in the matrix. Here, the weights of edges ( l 1,

r 6), ( l 2, r 4), ( l 3, r 1), ( l 4, r 2), ( l 5, r 3), ( l 6, r 5), and ( l 7, r 7) sum to 65, which is the maximum weight of any matching.

The weight of the maximum-weight matching equals the sum of all

the feasible vertex labels. These problems—maximizing the weight of a

matching and minimizing the sum of the feasible vertex labels—are

“duals” of each other, in a similar vein to how the value of a maximum

flow equals the capacity of a minimum cut. Section 29.3 explores duality in more depth.

Figure 25.8 Another update to the feasible vertex labeling and directed equality subgraph GM,h because the queue became empty before finding an M-augmenting path. (a) With δ = 1, the values of l 1. h, l 2. h, l 3. h, l 4. h, l 5. h, and l 7. h decrease by 1, and r 2. h, r 3. h, r 4. h, and r 7. h increase by 1. Edge ( l 6, r 3) leaves GM,h, and edges ( l 1, r 6), ( l 5, r 6) and ( l 7, r 6) enter. (b) The resulting directed equality subgraph GM,h. (c) With edge ( l 1, r 6) added to the breadth-first forest and r 6

unmatched, the search terminates, having found the M-augmenting path 〈( l 5, r 3), ( r 3, l 1), ( l 1, r 6)〉, highlighted in orange in parts (b) and (c).

The Hungarian algorithm

The procedure HUNGARIAN on page 737 and its subroutine FIND-

AUGMENTING-PATH on page 738 follow the steps we have just seen.

The third property in Lemma 25.15 ensures that in line 23 of FIND-

AUGMENTING-PATH the queue Q is nonempty. The pseudocode

uses the attribute π to indicate predecessor vertices in the breadth-first

forest. Instead of coloring vertices, as in the BFS procedure on page

556, the search puts the discovered vertices into the sets FL and FR.

Because the Hungarian algorithm does not need breadth-first distances,

the pseudocode omits the d attribute computed by the BFS procedure.

Image 856

Figure 25.9 (a) The new matching M and the new directed equality subgraph GM,h after updating the matching in Figure 25.8 with the M-augmenting path in Figure 25.8 parts (b) and (c). (b)–(h) Successive versions of the breadth-first forest F in a new breadth-first search with root l 7. After the vertex l 4 in part (h) has been removed from the queue, the queue becomes empty before the search discovers an unmatched vertex in R.

Now, let’s see why the Hungarian algorithm runs in O( n 4) time, where | V| = n/2 and | E| = n 2 in the original graph G. (Below we outline how to reduce the running time to O( n 3).) You can go through the pseudocode of HUNGARIAN to verify that lines 1–6 and 11 take

O( n 2) time. The while loop of lines 7–10 iterates at most n times, since each iteration increases the size of the matching M by 1. Each test in

line 7 can take constant time by just checking whether | M| < n, each update of M in line 9 takes O( n) time, and the updates in line 10 take O( n 2) time.

To achieve the O( n 4) time bound, it remains to show that each call of

FIND-AUGMENTING-PATH runs in O( n 3) time. Let’s call each

execution of lines 10–22 a growth step. Ignoring the growth steps, you

can verify that FIND-AUGMENTING-PATH is a breadth-first search.

With the sets FL and FR represented appropriately, the breadth-first search takes O( V + E) = O( n 2) time. Within a call of FIND-AUGMENTING-PATH, at most n growth steps can occur, since each

growth step is guaranteed to discover at least one vertex in R. Since there are at most n 2 edges in GM,h, the for loop of lines 16–22 iterates at most n 2 times per call of FIND-AUGMENTING-PATH. The

bottleneck is lines 10 and 15, which take O( n 2) time, so that FIND-AUGMENTING-PATH takes O( n 3) time.

Image 857

Figure 25.10 Updating the feasible vertex labeling and directed equality subgraph GM,h. (a) Here, δ = 2, so the values of l 1. h, l 2. h, l 3. h, l 4. h, l 5. h, and l 7. h decreased by 2, and the values of r 2. h, r 3. h, r 4. h, r 6. h, and r 7. h increased by 2. Edges ( l 2, r 5), ( l 3, r 1), ( l 4, r 5), ( l 5, r 1), and ( l 5, r 5) enter GM,h. (b) The resulting directed graph GM,h. (c) With edge ( l 3, r 1) added to the breadth-first forest and r 1 unmatched, the search terminates, having found the M-augmenting path 〈( l 7, r 7), ( r 7, l 3), ( l 3, r 1)〉, highlighted in orange in parts (b) and (c).

Exercise 25.3-5 asks you to show that reconstructing the directed

equality subgraph GM,h in line 15 is actually unnecessary, so that its

cost can be eliminated. Reducing the cost of computing δ in line 10 to

O( n) takes a little more effort and is the subject of Problem 25-2. With these changes, each call of FIND-AUGMENTING-PATH takes O( n 2)

time, so that the Hungarian algorithm runs in O( n 3) time.

Image 858

Figure 25.11 The final matching, shown for the equality subgraph Gh with blue edges and blue entries in the matrix. The weights of the edges in the matching sum to 65, which is the maximum for any matching in the original complete bipartite graph G, as well as the sum of all the final feasible vertex labels.

HUNGARIAN ( G)

1 for each vertex lL

2

l. h = max { w( l, r) : rR}

// from equation (25.1)

3 for each vertex rR

4

r. h = 0

// from equation (25.2)

5 let M be any matching in Gh (such as the matching returned by

GREEDY-BIPARTITE-MATCHING)

6 from G, M, and h, form the equality subgraph Gh

and the directed equality subgraph GM,h

7 while M is not a perfect matching in Gh

8

P = FIND-AUGMENTING-PATH ( GM,h)

9

M = MP

10

update the equality subgraph Gh

and the directed equality subgraph GM,h

11 return M

FIND-AUGMENTING-PATH ( GM,h)

1 Q = Ø

2 FL = Ø

3 FR = Ø

4 for each unmatched vertex lL

5

l. π = NIL

6

ENQUEUE ( Q, l)

7

FL = FL ∪ { l}

// forest F starts with unmatched

vertices in L

8 repeat

9

if Q is empty

// ran out of vertices to search from?

10

δ = min { l. h + r. hw( l, r) : lFL and rRFR}

11

for each vertex lFL

12

l. h = l. hδ

// relabel according to equation

(25.5)

13

for each vertex rFR

14

r. h = r. h + δ

// relabel according to equation

(25.5)

15

from G, M, and h, form a new directed equality graph GM,h

16

for each new edge ( l, r)

// continue search with

in GM,h

new edges

17

if rFR

18

r. π = l

// discover r, add it to

F

19

if r is unmatched

20

an M-augmenting path has been found

20

(exit the repeat loop)

21

else ENQUEUE

// can search from r

( Q, r)

later

22

FR = FR ∪ { r}

23

u = DEQUEUE ( Q)

// search from u

24

for each neighbor v of u in GM,h

25

if vL

26

v.π = u

27

FL = FL ∪ { v}

// discover v, add it to

F

28

ENQUEUE ( Q, v)

// can search from v

later

29

elseif vFR

// vR, do same as

lines 18–22

30

v.π = u

31

if v is unmatched

32

an M-augmenting path has been found

(exit the repeat loop)

33

else ENQUEUE ( Q, v)

34

FR = FR ∪ { v}

35 until an M-augmenting path has been found

36 using the predecessor attributes π, construct an M-augmenting path

P by tracing back from the unmatched vertex in R

37 return P

Exercises

25.3-1

The FIND-AUGMENTING-PATH procedure checks in two places

(lines 19 and 31) whether a vertex it discovers in R is unmatched. Show

how to rewrite the pseudocode so that it checks for an unmatched

vertex in R in only one place. What is the downside of doing so?

25.3-2

Show that for any bipartite graph, the GREEDY-BIPARTITE-

MATCHING procedure on page 726 returns a matching at least half

the size of a maximum matching.

25.3-3

Show that if an edge ( l, r) belongs to the directed equality subgraph GM,h but is not a member of GM,h′, where h′ is given by equation (25.5), then lLFL and rFR at the time that h′ is computed.

25.3-4

At line 29 in the FIND-AUGMENTING-PATH procedure, it has

already been established that vR. This line checks to see whether v is already discovered by testing whether vFR. Why doesn’t the procedure need to check whether v is already discovered for the case when vL, in lines 26–28?

25.3-5

Professor Hrabosky asserts that the directed equality subgraph GM,h

must be constructed and maintained by the Hungarian algorithm, so

that line 6 of HUNGARIAN and line 15 of FIND-AUGMENTING-

PATH are required. Argue that the professor is incorrect by showing

how to determine whether an edge belongs to EM,h without explicitly

constructing GM,h.

25.3-6

How can you modify the Hungarian algorithm to find a matching of

vertices in L to vertices in R that minimizes, rather than maximizes, the sum of the edge weights in the matching?

25.3-7

How can an assignment problem with | L| ≠ | R| be modified so that the

Hungarian algorithm solves it?

Problems

25-1 Perfect matchings in a regular bipartite graph

a. Problem 20-3 asked about Euler tours in directed graphs. Prove that a

connected, undirected graph G = ( V, E) has an Euler tour—a cycle

traversing each edge exactly once, though it may visit a vertex multiple times—if and only if the degree of every vertex in V is even.

b. Assuming that G is connected, undirected, and every vertex in V has even degree, give an O( E)-time algorithm to find an Euler tour of G, as in Problem 20-3(b).

c. Exercise 25.1-6 states that if G = ( V, E) is a d-regular bipartite graph, then it contains d disjoint perfect matchings. Suppose that d is an

exact power of 2. Give an algorithm to find all d disjoint perfect

matchings in a d-regular bipartite graph in Θ( E lg d) time.

25-2 Reducing the running time of the Hungarian algorithm to O( n 3) In this problem, you will show how to reduce the running time of the

Hungarian algorithm from O( n 4) to O( n 3) by showing how to reduce the running time of the FIND-AUGMENTING-PATH procedure

from O( n 3) to O( n 2). Exercise 25.3-5 demonstrates that line 6 of HUNGARIAN and line 15 of FIND-AUGMENTING-PATH are

unnecessary. Now you will show how to reduce the running time of each

execution of line 10 in FIND-AUGMENTING-PATH to O( n).

For each vertex rRFR, define a new attribute r.σ where r.σ = min { l. h + r. hw( l, r) : lFL}.

That is, r.σ indicates how close r is to being adjacent to some vertex l

FL in the directed equality subgraph Gm,h. Initially, before placing any vertices into FL, set r.σ to ∞ for all rR.

a. Show how to compute δ in line 10 in O( n) time, based on the σ

attribute.

b. Show how to update all the σ attributes in O( n) time after δ has been computed.

c. Show that updating all the σ attributes when FL changes takes O( n 2) time per call of FIND-AUGMENTING-PATH.

d. Conclude that the HUNGARIAN procedure can be implemented to run in O( n 3) time.

25-3 Other matching problems

The Hungarian algorithm finds a maximum-weight perfect matching in

a complete bipartite graph. It is possible to use the Hungarian

algorithm to solve problems in other graphs by modifying the input

graph, running the Hungarian algorithm, and then possibly modifying

the output. Show how to solve the following matching problems in this

manner.

a. Give an algorithm to find a maximum-weight matching in a weighted

bipartite graph that is not necessarily complete and with all edge

weights positive.

b. Redo part (a), but with edge weights allowed to also be 0 or negative.

c. A cycle cover in a directed graph, not necessarily bipartite, is a set of edge-disjoint directed cycles such that each vertex lies on at most one

cycle. Given nonnegative edge weights w( u, v), let C be the set of edges in a cycle cover, and define w( C) = ∑( u,v)∈ C w( u, v) to be the weight of the cycle cover. Give an algorithm to find a maximum-weight cycle

cover.

25-4 Fractional matchings

It is possible to define a fractional matching. Given a graph G = ( V, E), we define a fractional matching x as a function x : E → [0, 1] (real numbers between 0 and 1, inclusive) such that for every vertex uV,

we have ∑( u,v)∈ E x( u, v) ≤ 1. The value of a fractional matching is ∑( u, v)∈ E x( u, v). The definition of a fractional matching is identical to that of a matching, except that a matching has the additional constraint that

x( u, v) ∈ {0, 1} for all edges ( u, v) ∈ E. Given a graph, we let M*

denote a maximum matching and x* denote a fractional matching with

maximum value.

Image 859

a. Argue that, for any bipartite graph, we must have ∑( u, v)∈ E x*( u, v) ≥

| M*|.

b. Prove that, for any bipartite graph, we must have ∑( u, v)∈ E x*( e) ≤

| M*|. ( Hint: Give an algorithm that converts a fractional matching

with an integer value to a matching.) Conclude that the maximum

value of a fractional matching in a bipartite graph is the same as the

size of the maximum cardinality matching.

c. We can define a fractional matching in a weighted graph in the same

manner: the value of the matching is now ∑( u, v)∈ E w( u, v) x( u, v).

Extend the results of the previous parts to show that in a weighted

bipartite graph, the maximum value of a weighted fractional matching

is equal to the value of a maximum weighted matching.

d. In a general graph, the analogous results do not necessarily hold.

Give an example of a small graph that is not bipartite for which the

fractional matching with maximum value is not a maximum

matching.

25-5 Computing vertex labels

You are given a complete bipartite graph G = ( V, E) with edge weights w( l, r) for all ( l, r) ∈ E. You are also given a maximum-weight perfect matching M* for G. You wish to compute a feasible vertex labeling h such that M* is a perfect matching in the equality subgraph Gh. That is, you want to compute a labeling h of vertices such that

(Requirement (25.6) holds for all edges, and the stronger requirement

(25.7) holds for all edges in M*.) Give an algorithm to compute the feasible vertex labeling h, and prove that it is correct. ( Hint: Use the similarity between conditions (25.6) and (25.7) and some of the

properties of shortest paths proved in Chapter 22, in particular the triangle inequality (Lemma 22.10) and the convergence property

(Lemma 22.14.))

Image 860

Image 861

Chapter notes

Matching algorithms have a long history and have been central to many

breakthroughs in algorithm design and analysis. The book by Lovász

and Plummer [306] is an excellent reference on matching problems, and the chapter on matching in the book by Ahuja, Magnanti and Orlin [10]

also has extensive references.

The Hopcroft-Karp algorithm is by Hopcroft and Karp [224].

Madry [308] gave an Õ( E 10/7)-time algorithm, which is asymptotically faster than Hopcroft-Karp for sparse graphs.

Corollary 25.4 is due to Berge [53], and it also holds in graphs that are not bipartite. Matching in general graphs requires more complicated

algorithms. The first polynomial-time algorithm, running in O( V 4) time, is due to Edmonds [130] (in a paper that also introduced the notion of a polynomial-time algorithm). Like the bipartite case, this

algorithm also uses augmenting paths, although the algorithm for

finding augmenting paths in general graphs is more involved than the

one for bipartite graphs. Subsequently, several

-time algorithms

appeared, including ones by Gabow and Tarjan [168] as part of an algorithm for weighted matching and a simpler one by Gabow [164].

The Hungarian algorithm is described in the book by Bondy and

Murty [67] and is based on work by Kuhn [273] and Munkres [337].

Kuhn adopted the name “Hungarian algorithm” because the algorithm

derived from work by the Hungarian mathematicians D. Kőnig and J.

Egervéry. The algorithm is an early example of a primal-dual algorithm.

A faster algorithm that runs in

time, where the edge

weights are integers from 0 to W, was given by Gabow and Tarjan [167], and an algorithm with the same time bound for maximum-weight

matching in general graphs was given by Duan, Pettie, and Su [127].

The stable-marriage problem was first defined and analyzed by Gale

and Shapley [169]. The stable-marriage problem has numerous variants.

The books by Gusfield and Irving [203], Knuth [266], and Manlove

[313] serve as excellent sources for cataloging and solving them.

1 The definition of a complete bipartite graph differs from the definition of complete graph given on page 1167 because in a bipartite graph, there are no edges between vertices in L and no edges between vertices in R.

2 Although marriage norms are changing, it’s traditional to view the stable-marriage problem through the lens of heterosexual marriage.

Part VII Selected Topics

Introduction

This part contains a selection of algorithmic topics that extend and

complement earlier material in this book. Some chapters introduce new

models of computation such as circuits or parallel computers. Others

cover specialized domains such as matrices or number theory. The last

two chapters discuss some of the known limitations to the design of

efficient algorithms and introduce techniques for coping with those

limitations.

Chapter 26 presents an algorithmic model for parallel computing based on task-parallel computing, and more specifically, fork-join

parallelism. The chapter introduces the basics of the model, showing

how to quantify parallelism in terms of the measures of work and span.

It then investigates several interesting fork-join algorithms, including

algorithms for matrix multiplication and merge sorting.

An algorithm that receives its input over time, rather than having the

entire input available at the start, is called an “online” algorithm.

Chapter 27 examines techniques used in online algorithms, starting with the “toy” problem of how long to wait for an elevator before taking the

stairs. It then studies the “move-to-front” heuristic for maintaining a

linked list and finishes with the online version of the caching problem

we saw back in Section 15.4. The analyses of these online algorithms are remarkable in that they prove that these algorithms, which do not know

their future inputs, perform within a constant factor of optimal

algorithms that know the future inputs.

Chapter 28 studies efficient algorithms for operating on matrices. It presents two general methods—LU decomposition and LUP

decomposition—for solving linear equations by Gaussian elimination in

O( n 3) time. It also shows that matrix inversion and matrix

multiplication can be performed equally fast. The chapter concludes by

showing how to compute a least-squares approximate solution when a

set of linear equations has no exact solution.

Chapter 29 studies how to model problems as linear programs, where the goal is to maximize or minimize an objective, given limited resources

and competing constraints. Linear programming arises in a variety of

practical application areas. The chapter also addresses the concept of

“duality” which, by establishing that a maximization problem and

minimization problem have the same objective value, helps to show that

solutions to each are optimal.

Chapter 30 studies operations on polynomials and shows how to use

a well-known signal-processing technique—the fast Fourier transform

(FFT)—to multiply two degree- n polynomials in O( n lg n) time. It also derives a parallel circuit to compute the FFT.

Chapter 31 presents number-theoretic algorithms. After reviewing elementary number theory, it presents Euclid’s algorithm for computing

greatest common divisors. Next, it studies algorithms for solving

modular linear equations and for raising one number to a power

modulo another number. Then, it explores an important application of

number-theoretic algorithms: the RSA public-key cryptosystem. This

cryptosystem can be used not only to encrypt messages so that an

adversary cannot read them, but also to provide digital signatures. The

chapter finishes with the Miller-Rabin randomized primality test, which

enables finding large primes efficiently—an essential requirement for the

RSA system.

Chapter 32 studies the problem of finding all occurrences of a given pattern string in a given text string, a problem that arises frequently in

text-editing programs. After examining the naive approach, the chapter

presents an elegant approach due to Rabin and Karp. Then, after

showing an efficient solution based on finite automata, the chapter

presents the Knuth-Morris-Pratt algorithm, which modifies the

automaton-based algorithm to save space by cleverly preprocessing the

pattern. The chapter finishes by studying suffix arrays, which can not

only find a pattern in a text string, but can do quite a bit more, such as

finding the longest repeated substring in a text and finding the longest

common substring appearing in two texts.

Chapter 33 examines three algorithms within the expansive field of machine learning. Machine-learning algorithms are designed to take in

vast amounts of data, devise hypotheses about patterns in the data, and

test these hypotheses. The chapter starts with k-means clustering, which

groups data elements into k classes based on how similar they are to each other. It then shows how to use the technique of multiplicative

weights to make predictions accurately based on a set of “experts” of

varying quality. Perhaps surprisingly, even without knowing which

experts are reliable and which are not, you can predict almost as

accurately as the most reliable expert. The chapter finishes with gradient

descent, an optimization technique that finds a local minimum value for

a function. Gradient descent has many applications, including finding

parameter settings for many machine-learning models.

Chapter 34 concerns NP-complete problems. Many interesting

computational problems are NP-complete, but no polynomial-time

algorithm is known for solving any of them. This chapter presents

techniques for determining when a problem is NP-complete, using them

to prove several classic problems NP-complete: determining whether a

graph has a hamiltonian cycle (a cycle that includes every vertex),

determining whether a boolean formula is satisfiable (whether there

exists an assignment of boolean values to its variables that causes the

formula to evaluate to TRUE), and determining whether a given set of

numbers has a subset that adds up to a given target value. The chapter

also proves that the famous traveling-salesperson problem (find a

shortest route that starts and ends at the same location and visits each

of a set of locations once) is NP-complete.

Chapter 35 shows how to find approximate solutions to NP-

complete problems efficiently by using approximation algorithms. For

some NP-complete problems, approximate solutions that are near

optimal are quite easy to produce, but for others even the best

approximation algorithms known work progressively more poorly as the

problem size increases. Then, there are some problems for which investing increasing amounts of computation time yields increasingly

better approximate solutions. This chapter illustrates these possibilities

with the vertex-cover problem (unweighted and weighted versions), an

optimization version of 3-CNF satisfiability, the traveling-salesperson

problem, the set-covering problem, and the subset-sum problem.

26 Parallel Algorithms

The vast majority of algorithms in this book are serial algorithms

suitable for running on a uniprocessor computer that executes only one

instruction at a time. This chapter extends our algorithmic model to

encompass parallel algorithms, where multiple instructions can execute

simultaneously. Specifically, we’ll explore the elegant model of task-

parallel algorithms, which are amenable to algorithmic design and

analysis. Our study focuses on fork-join parallel algorithms, the most

basic and best understood kind of task-parallel algorithm. Fork-join

parallel algorithms can be expressed cleanly using simple linguistic

extensions to ordinary serial code. Moreover, they can be implemented

efficiently in practice.

Parallel computers—computers with multiple processing units—are

ubiquitous. Handheld, laptop, desktop, and cloud machines are all

multicore computers, or simply, multicores, containing multiple

processing “cores.” Each processing core is a full-fledged processor that

can directly access any location in a common shared memory.

Multicores can be aggregated into larger systems, such as clusters, by

using a network to interconnect them. These multicore clusters usually

have a distributed memory, where one multicore’s memory cannot be

accessed directly by a processor in another multicore. Instead, the

processor must explicitly send a message over the cluster network to a

processor in the remote multicore to request any data it requires. The

most powerful clusters are supercomputers, comprising many thousands

of multicores. But since shared-memory programming tends to be

conceptually easier than distributed-memory programming, and

multicore machines are widely available, this chapter focuses on parallel algorithms for multicores.

One approach to programming multicores is thread parallelism. This

processor-centric parallel-programming model employs a software

abstraction of “virtual processors,” or threads that share a common

memory. Each thread maintains its own program counter and can

execute code independently of the other threads. The operating system

loads a thread onto a processing core for execution and switches it out

when another thread needs to run.

Unfortunately, programming a shared-memory parallel computer

using threads tends to be difficult and error-prone. One reason is that it

can be complicated to dynamically partition the work among the

threads so that each thread receives approximately the same load. For

any but the simplest of applications, the programmer must use complex

communication protocols to implement a scheduler that load-balances

the work.

Task-parallel programming

The difficulty of thread programming has led to the creation of task-

parallel platforms, which provide a layer of software on top of threads

to coordinate, schedule, and manage the processors of a multicore.

Some task-parallel platforms are built as runtime libraries, but others

provide full-fledged parallel languages with compiler and runtime

support.

Task-parallel programming allows parallelism to be specified in a

“processor-oblivious” fashion, where the programmer identifies what

computational tasks may run in parallel but does not indicate which

thread or processor performs the task. Thus, the programmer is freed

from worrying about communication protocols, load balancing, and

other vagaries of thread programming. The task-parallel platform

contains a scheduler, which automatically load-balances the tasks across

the processors, thereby greatly simplifying the programmer’s chore.

Task-parallel algorithms provide a natural extension to ordinary serial

algorithms, allowing performance to be reasoned about mathematically

using “work/span analysis.”

Fork-join parallelism

Although the functionality of task-parallel environments is still evolving

and increasing, almost all support fork-join parallelism, which is

typically embodied in two linguistic features: spawning and parallel loops. Spawning allows a subroutine to be “forked”: executed like a

subroutine call, except that the caller can continue to execute while the

spawned subroutine computes its result. A parallel loop is like an

ordinary for loop, except that multiple iterations of the loop can execute

at the same time.

Fork-join parallel algorithms employ spawning and parallel loops to

describe parallelism. A key aspect of this parallel model, inherited from

the task-parallel model but different from the thread model, is that the

programmer does not specify which tasks in a computation must run in

parallel, only which tasks may run in parallel. The underlying runtime

system uses threads to load-balance the tasks across the processors. This

chapter investigates parallel algorithms described in the fork-join

model, as well as how the underlying runtime system can schedule task-

parallel computations (which include fork-join computations)

efficiently.

Fork-join parallelism offers several important advantages:

The fork-join programming model is a simple extension of the

familiar serial programming model used in most of this book. To

describe a fork-join parallel algorithm, the pseudocode in this

book needs just three added keywords: parallel, spawn, and sync.

Deleting these parallel keywords from the parallel pseudocode

results in ordinary serial pseudocode for the same problem, which

we call the “serial projection” of the parallel algorithm.

The underlying task-parallel model provides a theoretically clean

way to quantify parallelism based on the notions of “work” and

“span.”

Spawning allows many divide-and-conquer algorithms to be

parallelized naturally. Moreover, just as serial divide-and-conquer

algorithms lend themselves to analysis using recurrences, so do

parallel algorithms in the fork-join model.

Image 862

The fork-join programming model is faithful to how multicore

programming has been evolving in practice. A growing number of

multicore environments support one variant or another of fork-

join parallel programming, including Cilk [290, 291, 383, 396], Habanero-Java [466], the Java Fork-Join Framework [279], OpenMP [81], Task Parallel Library [289], Threading Building Blocks [376], and X10 [82].

Section 26.1 introduces parallel pseudocode, shows how the

execution of a task-parallel computation can be modeled as a directed

acyclic graph, and presents the metrics of work, span, and parallelism,

which you can use to analyze parallel algorithms. Section 26.2

investigates how to multiply matrices in parallel, and Section 26.3

tackles the tougher problem of designing an efficient parallel merge sort.

26.1 The basics of fork-join parallelism

Our exploration of parallel programming begins with the problem of

computing Fibonacci numbers recursively in parallel. We’ll look at a

straightforward serial Fibonacci calculation, which, although inefficient,

serves as a good illustration of how to express parallelism in

pseudocode.

Recall that the Fibonacci numbers are defined by equation (3.31) on

page 69:

To calculate the n th Fibonacci number recursively, you could use the ordinary serial algorithm in the procedure FIB on the facing page. You

would not really want to compute large Fibonacci numbers this way,

because this computation does needless repeated work, but parallelizing

it can be instructive.

FIB ( n)

Image 863

Image 864

1if n ≤ 1

2

return n

3else x = FIB ( n − 1)

4

y = FIB ( n − 2)

5

return x + y

To analyze this algorithm, let T ( n) denote the running time of FIB

( n). Since FIB ( n) contains two recursive calls plus a constant amount of extra work, we obtain the recurrence

T ( n) = T ( n − 1) + T ( n − 2) + Θ(1).

This recurrence has solution T ( n) = Θ( Fn), which we can establish by using the substitution method (see Section 4.3). To show that T ( n) =

O( Fn), we’ll adopt the inductive hypothesis that T ( n) ≤ aFnb, where a

> 1 and b > 0 are constants. Substituting, we obtain

T ( n) ≤ ( aFn1b) + ( aFn2b) + Θ(1)

= a( Fn−1 + Fn−2) − 2 b + Θ(1)

aFnb,

if we choose b large enough to dominate the upper-bound constant in

the Θ(1) term. We can then choose a large enough to upper-bound the

Θ(1) base case for small n. To show that T ( n) = Ω( Fn), we use the inductive hypothesis T ( n) ≥ aFnb. Substituting and following reasoning similar to the asymptotic upper-bound argument, we

establish this hypothesis by choosing b smaller than the lower-bound

constant in the Θ(1) term and a small enough to lower-bound the Θ(1)

base case for small n. Theorem 3.1 on page 56 then establishes that T ( n)

= Θ( Fn), as desired. Since Fn = Θ( ϕn), where

is the

golden ratio, by equation (3.34) on page 69, it follows that

Thus this procedure is a particularly slow way to compute Fibonacci

numbers, since it runs in exponential time. (See Problem 31-3 on page

Image 865

954 for faster ways.)

Let’s see why the algorithm is inefficient. Figure 26.1 shows the tree of recursive procedure instances created when computing F 6 with the

FIB procedure. The call to FIB(6) recursively calls FIB(5) and then

FIB(4). But, the call to FIB(5) also results in a call to FIB(4). Both

instances of FIB(4) return the same result ( F 4 = 3). Since the FIB

procedure does not memoize (recall the definition of “memoize” from

page 368), the second call to FIB(4) replicates the work that the first call

performs, which is wasteful.

Figure 26.1 The invocation tree for FIB(6). Each node in the tree represents a procedure instance whose children are the procedure instances it calls during its execution. Since each instance of FIB with the same argument does the same work to produce the same result, the inefficiency of this algorithm for computing the Fibonacci numbers can be seen by the vast number of repeated calls to compute the same thing. The portion of the tree shaded blue appears in task-parallel form in Figure 26.2.

Although the FIB procedure is a poor way to compute Fibonacci

numbers, it can help us warm up to parallelism concepts. Perhaps the

most basic concept is to understand is that if two parallel tasks operate

on entirely different data, then—absent other interference—they each

produce the same outcomes when executed at the same time as when

they run serially one after the other. Within FIB ( n), for example, the

two recursive calls in line 3 to FIB ( n − 1) and in line 4 to FIB ( n − 2)

can safely execute in parallel because the computation performed by one in no way affects the other.

Parallel keywords

The P-FIB procedure on the next page computes Fibonacci numbers,

but using the parallel keywords spawn and sync to indicate parallelism in

the pseudocode.

If the keywords spawn and sync are deleted from P-FIB, the resulting

pseudocode text is identical to FIB (other than renaming the procedure

in the header and in the two recursive calls). We define the serial

projection 1 of a parallel algorithm to be the serial algorithm that results from ignoring the parallel directives, which in this case can be done by

omitting the keywords spawn and sync. For parallel for loops, which

we’ll see later on, we omit the keyword parallel. Indeed, our parallel

pseudocode possesses the elegant property that its serial projection is

always ordinary serial pseudocode to solve the same problem.

P-FIB ( n)

1 if n ≤ 1

2

return n

3 else x = spawn P-FIB ( n − 1) // don’t wait for subroutine to return

4

y = P-FIB ( n − 2)

// in parallel with spawned subroutine

5

sync

// wait for spawned subroutine to finish

6

return x + y

Semantics of parallel keywords

Spawning occurs when the keyword spawn precedes a procedure call, as

in line 3 of P-FIB. The semantics of a spawn differs from an ordinary

procedure call in that the procedure instance that executes the spawn—

the parent—may continue to execute in parallel with the spawned

subroutine—its child—instead of waiting for the child to finish, as

would happen in a serial execution. In this case, while the spawned child

is computing P-FIB ( n − 1), the parent may go on to compute P-FIB

( n−2) in line 4 in parallel with the spawned child. Since the P-FIB

procedure is recursive, these two subroutine calls themselves create nested parallelism, as do their children, thereby creating a potentially

vast tree of subcomputations, all executing in parallel.

The keyword spawn does not say, however, that a procedure must

execute in parallel with its spawned children, only that it may. The parallel keywords express the logical parallelism of the computation,

indicating which parts of the computation may proceed in parallel. At

runtime, it is up to a scheduler to determine which subcomputations

actually run in parallel by assigning them to available processors as the

computation unfolds. We’ll discuss the theory behind task-parallel

schedulers shortly (on page 759).

A procedure cannot safely use the values returned by its spawned

children until after it executes a sync statement, as in line 5. The

keyword sync indicates that the procedure must wait as necessary for all

its spawned children to finish before proceeding to the statement after

the sync—the “join” of a fork-join parallel computation. The P-FIB

procedure requires a sync before the return statement in line 6 to avoid

the anomaly that would occur if x and y were summed before P-FIB ( n

− 1) had finished and its return value had been assigned to x. In

addition to explicit join synchronization provided by the sync statement,

it is convenient to assume that every procedure executes a sync implicitly

before it returns, thus ensuring that all children finish before their

parent finishes.

A graph model for parallel execution

It helps to view the execution of a parallel computation—the dynamic

stream of runtime instructions executed by processors under the

direction of a parallel program—as a directed acyclic graph G = ( V, E), called a (parallel) trace.2 Conceptually, the vertices in V are executed instructions, and the edges in E represent dependencies between

instructions, where ( u, v) ∈ E means that the parallel program required instruction u to execute before instruction v.

It’s sometimes inconvenient, especially if we want to focus on the

parallel structure of a computation, for a vertex of a trace to represent

only one executed instruction. Consequently, if a chain of instructions

contains no parallel or procedural control (no spawn, sync, procedure call, or return—via either an explicit return statement or the return that

happens implicitly upon reaching the end of a procedure), we group the

entire chain into a single strand. As an example, Figure 26.2 shows the trace that results from computing P-FIB(4) in the portion of Figure 26.1

shaded blue. Strands do not include instructions that involve parallel or

procedural control. These control dependencies must be represented as

edges in the trace.

When a parent procedure calls a child, the trace contains an edge ( u,

v) from the strand u in the parent that executes the call to the first strand v of the spawned child, as illustrated in Figure 26.2 by the edge from the orange strand in P-FIB(4) to the blue strand in P-FIB(2).

When the last strand v′ in the child returns, the trace contains an edge

( v′, u′) to the strand u′, where u′ is the successor strand of u in the parent, as with the edge from the white strand in P-FIB(2) to the white

strand in P-FIB(4).

Image 866

Figure 26.2 The trace of P-FIB(4) corresponding to the shaded portion of Figure 26.1. Each circle represents one strand, with blue circles representing any instructions executed in the part of the procedure (instance) up to the spawn of P-FIB ( n − 1) in line 3; orange circles representing the instructions executed in the part of the procedure that calls P-FIB ( n − 2) in line 4 up to the sync in line 5, where it suspends until the spawn of P-FIB ( n − 1) returns; and white circles representing the instructions executed in the part of the procedure after the sync, where it sums x and y, up to the point where it returns the result. Strands belonging to the same procedure are grouped into a rounded rectangle, blue for spawned procedures and tan for called procedures. Assuming that each strand takes unit time, the work is 17 time units, since there are 17 strands, and the span is 8 time units, since the critical path—shown with blue edges—

contains 8 strands.

When the parent spawns a child, however, the trace is a little

different. The edge ( u, v) goes from parent to child as with a call, such as the edge from the blue strand in P-FIB(4) to the blue strand in P-FIB(3), but the trace contains another edge ( u, u′) as well, indicating that u’s successor strand u′ can continue to execute while v is executing.

The edge from the blue strand in P-FIB(4) to the orange strand in P-

FIB(4) illustrates one such edge. As with a call, there is an edge from the

last strand v′ in the child, but with a spawn, it no longer goes to u’s successor. Instead, the edge is ( v′, x), where x is the strand immediately following the sync in the parent that ensures that the child has finished,

as with the edge from the white strand in P-FIB(3) to the white strand

in P-FIB(4).

You can figure out what parallel control created a particular trace. If a strand has two successors, one of them must have been spawned, and

if a strand has multiple predecessors, the predecessors joined because of

a sync statement. Thus, in the general case, the set V forms the set of

strands, and the set E of directed edges represents dependencies between

strands induced by parallel and procedural control. If G contains a

directed path from strand u to strand v, we say that the two strands are (logically) in series. If there is no path in G either from u to v or from v to u, the strands are (logically) in parallel.

A fork-join parallel trace can be pictured as a dag of strands

embedded in an invocation tree of procedure instances. For example,

Figure 26.1 shows the invocation tree for FIB(6), which also serves as the invocation tree for P-FIB(6), the edges between procedure instances

now representing either calls or spawns. Figure 26.2 zooms in on the subtree that is shaded blue, showing the strands that constitute each

procedure instance in P-FIB(4). All directed edges connecting strands

run either within a procedure or along undirected edges of the

invocation tree in Figure 26.1. (More general task-parallel traces that are not fork-join traces may contain some directed edges that do not

run along the undirected tree edges.)

Our analyses generally assume that parallel algorithms execute on an

ideal parallel computer, which consists of a set of processors and a sequentially consistent shared memory. To understand sequential

consistency, you first need to know that memory is accessed by load

instructions, which copy data from a location in the memory to a

register within a processor, and by store instructions, which copy data from a processor register to a location in the memory. A single line of

pseudocode can entail several such instructions. For example, the line x

= y + z could result in load instructions to fetch each of y and z from memory into a processor, an instruction to add them together inside the

processor, and a store instruction to place the result x back into

memory. In a parallel computer, several processors might need to load

or store at the same time. Sequential consistency means that even if

multiple processors attempt to access the memory simultaneously, the

shared memory behaves as if exactly one instruction from one of the

processors is executed at a time, even though the actual transfer of data

may happen at the same time. It is as if the instructions were executed one at a time sequentially according to some global linear order among

all the processors that preserves the individual orders in which each

processor executes its own instructions.

For task-parallel computations, which are scheduled onto processors

automatically by a runtime system, the sequentially consistent shared

memory behaves as if a parallel computation’s executed instructions

were executed one by one in the order of a topological sort (see Section

20.4) of its trace. That is, you can reason about the execution by

imagining that the individual instructions (not generally the strands,

which may aggregate many instructions) are interleaved in some linear

order that preserves the partial order of the trace. Depending on

scheduling, the linear order could vary from one run of the program to

the next, but the behavior of any execution is always as if the

instructions executed serially in a linear order consistent with the

dependencies within the trace.

In addition to making assumptions about semantics, the ideal

parallel-computer model makes some performance assumptions.

Specifically, it assumes that each processor in the machine has equal

computing power, and it ignores the cost of scheduling. Although this

last assumption may sound optimistic, it turns out that for algorithms

with sufficient “parallelism” (a term we’ll define precisely a little later),

the overhead of scheduling is generally minimal in practice.

Performance measures

We can gauge the theoretical efficiency of a task-parallel algorithm

using work/span analysis, which is based on two metrics: “work” and

“span.” The work of a task-parallel computation is the total time to execute the entire computation on one processor. In other words, the

work is the sum of the times taken by each of the strands. If each strand

takes unit time, the work is just the number of vertices in the trace. The

span is the fastest possible time to execute the computation on an

unlimited number of processors, which corresponds to the sum of the

times taken by the strands along a longest path in the trace, where

“longest” means that each strand is weighted by its execution time. Such

Image 867

Image 868

a longest path is called the critical path of the trace, and thus the span is

the weight of the longest (weighted) path in the trace. (Section 22.2,

pages 617–619 shows how to find a critical path in a dag G = ( V, E) in Θ( V + E) time.) For a trace in which each strand takes unit time, the span equals the number of strands on the critical path. For example, the

trace of Figure 26.2 has 17 vertices in all and 8 vertices on its critical path, so that if each strand takes unit time, its work is 17 time units and

its span is 8 time units.

The actual running time of a task-parallel computation depends not

only on its work and its span, but also on how many processors are

available and how the scheduler allocates strands to processors. To

denote the running time of a task-parallel computation on P processors,

we subscript by P. For example, we might denote the running time of an

algorithm on P processors by TP. The work is the running time on a

single processor, or T 1. The span is the running time if we could run

each strand on its own processor—in other words, if we had an

unlimited number of processors—and so we denote the span by T∞.

The work and span provide lower bounds on the running time TP of

a task-parallel computation on P processors:

In one step, an ideal parallel computer with P processors can do

at most P units of work, and thus in TP time, it can perform at

most P TP work. Since the total work to do is T 1, we have P TP

T 1. Dividing by P yields the work law:

A P-processor ideal parallel computer cannot run any faster than

a machine with an unlimited number of processors. Looked at

another way, a machine with an unlimited number of processors

can emulate a P-processor machine by using just P of its

processors. Thus, the span law follows:

We define the speedup of a computation on P processors by the ratio T 1/ TP, which says how many times faster the computation runs on P

processors than on one processor. By the work law, we have TPT 1/ P, which implies that T 1/ TPP. Thus, the speedup on a P-processor ideal parallel computer can be at most P. When the speedup is linear in the

number of processors, that is, when T 1/ TP = Θ( P), the computation exhibits linear speedup. Perfect linear speedup occurs when T 1/ TP = P.

The ratio T 1/ T∞ of the work to the span gives the parallelism of the parallel computation. We can view the parallelism from three

perspectives. As a ratio, the parallelism denotes the average amount of

work that can be performed in parallel for each step along the critical

path. As an upper bound, the parallelism gives the maximum possible

speedup that can be achieved on any number of processors. Perhaps

most important, the parallelism provides a limit on the possibility of

attaining perfect linear speedup. Specifically, once the number of

processors exceeds the parallelism, the computation cannot possibly

achieve perfect linear speedup. To see this last point, suppose that P >

T 1/ T∞, in which case the span law implies that the speedup satisfies T 1/ TPT 1/ T∞ < P. Moreover, if the number P of processors in the ideal parallel computer greatly exceeds the parallelism—that is, if P

T 1/ T∞—then T 1/ TPP, so that the speedup is much less than the number of processors. In other words, if the number of processors

exceeds the parallelism, adding even more processors makes the

speedup less perfect.

As an example, consider the computation P-FIB(4) in Figure 26.2,

and assume that each strand takes unit time. Since the work is T 1 = 17

and the span is T∞ = 8, the parallelism is T 1/ T∞ = 17/8 = 2.125.

Consequently, achieving much more than double the performance is

impossible, no matter how many processors execute the computation.

For larger input sizes, however, we’ll see that P-FIB ( n) exhibits

substantial parallelism.

We define the (parallel) slackness of a task-parallel computation

executed on an ideal parallel computer with P processors to be the ratio

( T 1/ T∞)/ P = T 1/( P T∞), which is the factor by which the parallelism of the computation exceeds the number of processors in the machine.

Restating the bounds on speedup, if the slackness is less than 1, perfect

linear speedup is impossible, because T 1/( P T∞) < 1 and the span law imply that T 1/ TPT 1/ T∞ < P. Indeed, as the slackness decreases from 1 and approaches 0, the speedup of the computation diverges further

and further from perfect linear speedup. If the slackness is less than 1,

additional parallelism in an algorithm can have a great impact on its

execution efficiency. If the slackness is greater than 1, however, the work

per processor is the limiting constraint. We’ll see that as the slackness

increases from 1, a good scheduler can achieve closer and closer to

perfect linear speedup. But once the slackness is much greater than 1,

the advantage of additional parallelism shows diminishing returns.

Scheduling

Good performance depends on more than just minimizing the work and

span. The strands must also be scheduled efficiently onto the processors

of the parallel machine. Our fork-join parallel-programming model

provides no way for a programmer to specify which strands to execute

on which processors. Instead, we rely on the runtime system’s scheduler

to map the dynamically unfolding computation to individual processors.

In practice, the scheduler maps the strands to static threads, and the

operating system schedules the threads on the processors themselves.

But this extra level of indirection is unnecessary for our understanding

of scheduling. We can just imagine that the scheduler maps strands to

processors directly.

A task-parallel scheduler must schedule the computation without

knowing in advance when procedures will be spawned or when they will

finish—that is, it must operate online. Moreover, a good scheduler

operates in a distributed fashion, where the threads implementing the

scheduler cooperate to load-balance the computation. Provably good

online, distributed schedulers exist, but analyzing them is complicated.

Instead, to keep our analysis simple, we’ll consider an online centralized

scheduler that knows the global state of the computation at any

moment.

Image 869

In particular, we’ll analyze greedy schedulers, which assign as many

strands to processors as possible in each time step, never leaving a

processor idle if there is work that can be done. We’ll classify each step

of a greedy scheduler as follows:

Complete step: At least P strands are ready to execute, meaning that all strands on which they depend have finished execution. A

greedy scheduler assigns any P of the ready strands to the

processors, completely utilizing all the processor resources.

Incomplete step: Fewer than P strands are ready to execute. A greedy scheduler assigns each ready strand to its own processor,

leaving some processors idle for the step, but executing all the

ready strands.

The work law tells us that the fastest running time TP that we can

hope for on P processors must be at least T 1/ P. The span law tells us that the fastest possible running time must be at least T∞. The following

theorem shows that greedy scheduling is provably good in that it

achieves the sum of these two lower bounds as an upper bound.

Theorem 26.1

On an ideal parallel computer with P processors, a greedy scheduler

executes a task-parallel computation with work T 1 and span T∞ in time

Proof Without loss of generality, assume that each strand takes unit

time. (If necessary, replace each longer strand by a chain of unit-time

strands.) We’ll consider complete and incomplete steps separately.

In each complete step, the P processors together perform a total of P

work. Thus, if the number of complete steps is k, the total work

executing all the complete steps is kP. Since the greedy scheduler

doesn’t execute any strand more than once and only T 1 work needs to

be performed, it follows that kPT 1, from which we can conclude that the number k of complete steps is at most T 1/ P.

Image 870

Image 871

Now, let’s consider an incomplete step. Let G be the trace for the

entire computation, let G′ be the subtrace of G that has yet to be executed at the start of the incomplete step, and let G″ be the subtrace

remaining to be executed after the incomplete step. Consider the set R

of strands that are ready at the beginning of the incomplete step, where

| R| < P. By definition, if a strand is ready, all its predecessors in trace G

have executed. Thus the predecessors of strands in R do not belong to

G′. A longest path in G′ must necessarily start at a strand in R, since every other strand in G′ has a predecessor and thus could not start a longest path. Because the greedy scheduler executes all ready strands

during the incomplete step, the strands of G″ are exactly those in G

minus the strands in R. Consequently, the length of a longest path in G

must be 1 less than the length of a longest path in G′. In other words,

every incomplete step decreases the span of the trace remaining to be

executed by 1. Hence, the number of incomplete steps can be at most

T∞.Since each step is either complete or incomplete, the theorem follows.

The following corollary shows that a greedy scheduler always

performs well.

Corollary 26.2

The running time TP of any task-parallel computation scheduled by a

greedy scheduler on a P-processor ideal parallel computer is within a factor of 2 of optimal.

Proof Let T* P be the running time produced by an optimal scheduler on a machine with P processors, and let T 1 and T∞ be the work and span of the computation, respectively. Since the work and span laws—

inequalities (26.2) and (26.3)—give

, Theorem 26.1

implies that

The next corollary shows that, in fact, a greedy scheduler achieves

near-perfect linear speedup on any task-parallel computation as the

slackness grows.

Corollary 26.3

Let TP be the running time of a task-parallel computation produced by

a greedy scheduler on an ideal parallel computer with P processors, and

let T 1 and T∞be the work and span of the computation, respectively.

Then, if PT 1/ T∞, or equivalently, the parallel slackness is much greater than 1, we have TPT 1/ P, a speedup of approximately P.

Proof If we suppose that PT 1/ T∞, then it follows that T∞ ≪ T 1/ P, and hence Theorem 26.1 gives TPT 1/ P + T∞ ≈ T 1/ P. Since the work law (26.2) dictates that TPT 1/ P, we conclude that TPT 1/ P, which is a speedup of T 1/ TPP.

The ≪ symbol denotes “much less,” but how much is “much less”?

As a rule of thumb, a slackness of at least 10—that is, 10 times more

parallelism than processors—generally suffices to achieve good speedup.

Then, the span term in the greedy bound, inequality (26.4), is less than

10% of the work-per-processor term, which is good enough for most

engineering situations. For example, if a computation runs on only 10 or

100 processors, it doesn’t make sense to value parallelism of, say

1,000,000, over parallelism of 10,000, even with the factor of 100

difference. As Problem 26-2 shows, sometimes reducing extreme

parallelism yields algorithms that are better with respect to other

concerns and which still scale up well on reasonable numbers of

processors.

Analyzing parallel algorithms

We now have all the tools we need to analyze parallel algorithms using

work/span analysis, allowing us to bound an algorithm’s running time

Image 872

on any number of processors. Analyzing the work is relatively

straightforward, since it amounts to nothing more than analyzing the

running time of an ordinary serial algorithm, namely, the serial

projection of the parallel algorithm. You should already be familiar

with analyzing work, since that is what most of this textbook is about!

Analyzing the span is the new thing that parallelism engenders, but it’s

generally no harder once you get the hang of it. Let’s investigate the

basic ideas using the P-FIB program.

Analyzing the work T 1( n) of P-FIB ( n) poses no hurdles, because we’ve already done it. The serial projection of P-FIB is effectively the

original FIB procedure, and hence, we have T 1( n) = T ( n) = Θ( ϕn) from equation (26.1).

Figure 26.3 illustrates how to analyze the span. If two traces are joined in series, their spans add to form the span of their composition,

whereas if they are joined in parallel, the span of their composition is

the maximum of the spans of the two traces. As it turns out, the trace of

any fork-join parallel computation can be built up from single strands

by series-parallel composition.

Figure 26.3 Series-parallel composition of parallel traces. (a) When two traces are joined in series, the work of the composition is the sum of their work, and the span of the composition is the sum of their spans. (b) When two traces are joined in parallel, the work of the composition remains the sum of their work, but the span of the composition is only the maximum of their spans.

Armed with an understanding of series-parallel composition, we can

analyze the span of P-FIB ( n). The spawned call to P-FIB ( n − 1) in line

Image 873

3 runs in parallel with the call to P-FIB ( n − 2) in line 4. Hence, we can

express the span of P-FIB ( n) as the recurrence

T∞( n) = max { T∞( n − 1), T∞( n − 2)} + Θ(1)

= T∞( n − 1) + Θ(1),

which has solution T∞( n) = Θ( n). (The second equality above follows from the first because P-FIB ( n − 1) uses P-FIB ( n − 2) in its computation, so that the span of P-FIB ( n − 1) must be at least as large

as the span of P-FIB ( n − 2).)

The parallelism of P-FIB ( n) is T 1( n)/ T∞( n) = Θ( ϕn/ n), which grows dramatically as n gets large. Thus, Corollary 26.3 tells us that on even

the largest parallel computers, a modest value for n suffices to achieve

near perfect linear speedup for P-FIB ( n), because this procedure

exhibits considerable parallel slackness.

Parallel loops

Many algorithms contain loops for which all the iterations can operate

in parallel. Although the spawn and sync keywords can be used to

parallelize such loops, it is more convenient to specify directly that the

iterations of such loops can run in parallel. Our pseudocode provides

this functionality via the parallel keyword, which precedes the for

keyword in a for loop statement.

As an example, consider the problem of multiplying a square n × n

matrix A = ( aij) by an n-vector x = ( xj). The resulting n-vector y = ( yi) is given by the equation

for i = 1, 2, … , n. The P-MAT-VEC procedure performs matrix-vector

multiplication (actually, y = y + Ax) by computing all the entries of y in parallel. The parallel for keywords in line 1 of P-MAT-VEC indicate

that the n iterations of the loop body, which includes a serial for loop,

may be run in parallel. The initialization y = 0, if desired, should be

performed before calling the procedure (and can be done with a parallel for loop).

P-MAT-VEC ( A, x, y, n)

1 parallel for i = 1 to n

// parallel loop

2

for j = 1 to n

// serial loop

3

yi = yi + aij xj

Compilers for fork-join parallel programs can implement parallel for

loops in terms of spawn and sync by using recursive spawning. For

example, for the parallel for loop in lines 1–3, a compiler can generate

the auxiliary subroutine P-MAT-VEC-RECURSIVE and call P-MAT-

VEC-RECURSIVE ( A, x, y, n, 1, n) in the place where the loop would be in the compiled code. As Figure 26.4 illustrates, this procedure recursively spawns the first half of the iterations of the loop to execute

in parallel (line 5) with the second half of the iterations (line 6) and then

executes a sync (line 7), thereby creating a binary tree of parallel

execution. Each leaf represents a base case, which is the serial for loop

of lines 2–3.

P-MAT-VEC-RECURSIVE ( A, x, y, n, i, i′) 1 if i == i

// just one iteration to do?

2

for j = 1 to n

// mimic P-MAT-VEC serial loop

3

yi = yi + aij xj

4 else mid = ⌊( i + i′)/2⌊

// parallel divide-and-conquer

5

spawn P-MAT-VEC-RECURSIVE ( A, x, y, n, i, mid) 6

P-MAT-VEC-RECURSIVE ( A, x, y, n, mid + 1, i′) 7

sync

To calculate the work T 1( n) of P-MAT-VEC on an n× n matrix, simply compute the running time of its serial projection, which comes

from replacing the parallel for loop in line 1 with an ordinary for loop.

The running time of the resulting serial pseudocode is Θ( n 2), which

Image 874

means that T 1( n) = Θ( n 2). This analysis seems to ignore the overhead for recursive spawning in implementing the parallel loops, however.

Indeed, the overhead of recursive spawning does increase the work of a

parallel loop compared with that of its serial projection, but not

asymptotically. To see why, observe that since the tree of recursive

procedure instances is a full binary tree, the number of internal nodes is

one less than the number of leaves (see Exercise B.5-3 on page 1175).

Each internal node performs constant work to divide the iteration

range, and each leaf corresponds to a base case, which takes at least

constant time (Θ( n) time in this case). Thus, by amortizing the overhead

of recursive spawning over the work of the iterations in the leaves, we

see that the overall work increases by at most a constant factor.

Figure 26.4 A trace for the computation of P-MAT-VEC-RECURSIVE ( A, x, y, 8, 1, 8). The two numbers within each rounded rectangle give the values of the last two parameters ( i and i′ in the procedure header) in the invocation (spawn, in blue, or call, in tan) of the procedure. The blue circles represent strands corresponding to the part of the procedure up to the spawn of P-MAT-VEC-RECURSIVE in line 5. The orange circles represent strands corresponding to the

part of the procedure that calls P-MAT-VEC-RECURSIVE in line 6 up to the sync in line 7, where it suspends until the spawned subroutine in line 5 returns. The white circles represent strands corresponding to the (negligible) part of the procedure after the sync up to the point where it returns.

To reduce the overhead of recursive spawning, task-parallel

platforms sometimes coarsen the leaves of the recursion by executing

several iterations in a single leaf, either automatically or under

programmer control. This optimization comes at the expense of

reducing the parallelism. If the computation has sufficient parallel

slackness, however, near-perfect linear speedup won’t be sacrificed.

Although recursive spawning doesn’t affect the work of a parallel

loop asymptotically, we must take it into account when analyzing the

span. Consider a parallel loop with n iterations in which the i th iteration has span iter∞( i). Since the depth of recursion is logarithmic in the number of iterations, the parallel loop’s span is

T∞( n) = Θ(lg n) + max { iter∞( i) : 1 ≤ in}.

For example, let’s compute the span of the doubly nested loops in

lines 1–3 of P-MAT-VEC. The span for the parallel for loop control is

Θ(lg n). For each iteration of the outer parallel loop, the inner serial for

loop contains n iterations of line 3. Since each iteration takes constant

time, the total span for the inner serial for loop is Θ( n), no matter which

iteration of the outer parallel for loop it’s in. Thus, taking the maximum

over all iterations of the outer loop and adding in the Θ(lg n) for loop

control yields an overall span of Tn = Θ( n) + Θ(lg n) = Θ( n) for the procedure. Since the work is Θ( n 2), the parallelism is Θ( n 2)/Θ( n) = Θ( n).

(Exercise 26.1-7 asks you to provide an implementation with even more

parallelism.)

Race conditions

A parallel algorithm is deterministic if it always does the same thing on

the same input, no matter how the instructions are scheduled on the

multicore computer. It is nondeterministic if its behavior might vary

from run to run when the input is the same. A parallel algorithm that is

intended to be deterministic may nevertheless act nondeterministically,

however, if it contains a difficult-to-diagnose bug called a “determinacy

race.”

Famous race bugs include the Therac-25 radiation therapy machine,

which killed three people and injured several others, and the Northeast

Blackout of 2003, which left over 50 million people in the United States

without power. These pernicious bugs are notoriously hard to find. You

Image 875

can run tests in the lab for days without a failure, only to discover that

your software sporadically crashes in the field, sometimes with dire

consequences.

A determinacy race occurs when two logically parallel instructions

access the same memory location and at least one of the instructions

modifies the value stored in the location. The toy procedure RACE-

EXAMPLE on the following page illustrates a determinacy race. After

initializing x to 0 in line 1, RACE-EXAMPLE creates two parallel

strands, each of which increments x in line 3. Although it might seem

that a call of RACE-EXAMPLE should always print the value 2 (its

serial projection certainly does), it could instead print the value 1. Let’s

see how this anomaly might occur.

When a processor increments x, the operation is not indivisible, but

is composed of a sequence of instructions:

Figure 26.5 Illustration of the determinacy race in RACE-EXAMPLE. (a) A trace showing the dependencies among individual instructions. The processor registers are r 1 and r 2. Instructions unrelated to the race, such as the implementation of loop control, are omitted. (b) An execution sequence that elicits the bug, showing the values of x in memory and registers r 1 and r 2 for each step in the execution sequence.

RACE-EXAMPLE ( )

1 x = 0

2 parallel for i = 1 to 2

3

x = x + 1

// determinacy race

4 print x

Load x from memory into one of the processor’s registers.

Increment the value in the register.

Store the value in the register back into x in memory.

Figure 26.5(a) illustrates a trace representing the execution of RACE-EXAMPLE, with the strands broken down to individual instructions.

Recall that since an ideal parallel computer supports sequential

consistency, you can view the parallel execution of a parallel algorithm

as an interleaving of instructions that respects the dependencies in the

trace. Part (b) of the figure shows the values in an execution of the

computation that elicits the anomaly. The value x is kept in memory, and r 1 and r 2 are processor registers. In step 1, one of the processors sets x to 0. In steps 2 and 3, processor 1 loads x from memory into its register r 1 and increments it, producing the value 1 in r 1. At that point, processor 2 comes into the picture, executing instructions 4–6. Processor

2 loads x from memory into register r 2; increments it, producing the value 1 in r 2; and then stores this value into x, setting x to 1. Now, processor 1 resumes with step 7, storing the value 1 in r 1 into x, which leaves the value of x unchanged. Therefore, step 8 prints the value 1, rather than the value 2 that the serial projection would print.

Let’s recap what happened. By sequential consistency, the effect of

the parallel execution is as if the executed instructions of the two

processors are interleaved. If processor 1 executes all its instructions

before processor 2, a trivial interleaving, the value 2 is printed.

Conversely, if processor 2 executes all its instructions before processor 1,

the value 2 is still printed. When the instructions of the two processors

interleave nontrivially, however, it is possible, as in this example

execution, that one of the updates to x is lost, resulting in the value 1

being printed.

Of course, many executions do not elicit the bug. That’s the problem

with determinacy races. Generally, most instruction orderings produce

correct results, such as any where the instructions on the left branch

execute before the instructions on the right branch, or vice versa. But some orderings generate improper results when the instructions

interleave. Consequently, races can be extremely hard to test for. Your

program may fail, but you may be unable to reliably reproduce the

failure in subsequent tests, confounding your attempts to locate the bug

in your code and fix it. Task-parallel programming environments often

provide race-detection productivity tools to help you isolate race bugs.

Many parallel programs in the real world are intentionally

nondeterministic. They contain determinacy races, but they mitigate the

dangers of nondeterminism through the use of mutual-exclusion locks

and other methods of synchronization. For our purposes, however, we’ll

insist on an absence of determinacy races in the algorithms we develop.

Nondeterministic programs are indeed interesting, but nondeterministic

programming is a more advanced topic and unnecessary for a wide

swath of interesting parallel algorithms.

To ensure that algorithms are deterministic, any two strands that

operate in parallel should be mutually noninterfering: they only read, and do not modify, any memory locations accessed by both of them.

Consequently, in a parallel for construct, such as the outer loop of P-

MAT-VEC, we want all the iterations of the body, including any code

an iteration executes in subroutines, to be mutually noninterfering. And

between a spawn and its corresponding sync, we want the code executed

by the spawned child and the code executed by the parent to be

mutually noninterfering, once again including invoked subroutines.

As an example of how easy it is to write code with unintentional