7 return δ

The running time of COMPUTE-TRANSITION-FUNCTION is

O( m 3 |∑|), because the outer loops contribute a factor of m |∑|, the inner while loop can run at most m + 1 times, and the test for whether P[: k] is a suffix of P[: q] a on line 4 can require comparing up to m characters.

Much faster procedures exist. By utilizing some cleverly computed

information about the pattern P (see Exercise 32.4-8), the time required

to compute δ from P improves to O( m |∑|). This improved procedure for computing δ provides a way to find all occurrences of a length- m pattern in a length- n text over an alphabet ∑ with O( m |∑|) preprocessing time and Θ( n) matching time.

Exercises

32.3-1

Draw a state-transition diagram for the string-matching automaton for

the pattern P = aabab over the alphabet ∑ = {a, b} and illustrate its

operation on the text string T = aaababaabaababaab.

32.3-2

Draw a state-transition diagram for the string-matching automaton for

the pattern P = ababbabbababbababbabb over the alphabet ∑ = {a,

b}.

32.3-3

A pattern P is nonoverlappable if P[: k] ⊐ P[: q] implies k = 0 or k = q.

Describe the state-transition diagram of the string-matching automaton

for a nonoverlappable pattern.

32.3-4

Let x and y be prefixes of the pattern P. Prove that xy implies σ( x) ≤

σ( y).

32.3-5

Given two patterns P and P′, describe how to construct a finite automaton that determines all occurrences of either pattern. Try to

minimize the number of states in your automaton.

32.3-6

Given a pattern P containing gap characters (see Exercise 32.1-4), show

how to build a finite automaton that can find an occurrence of P in a

text T in O( n) matching time, where n = | T|.

★ 32.4 The Knuth-Morris-Pratt algorithm

Knuth, Morris, and Pratt developed a linear-time string matching

algorithm that avoids computing the transition function δ altogether.

Instead, the KMP algorithm uses an auxiliary function π, which it

precomputes from the pattern in Θ( m) time and stores in an array

π[1: m]. The array π allows the algorithm to compute the transition function δ efficiently (in an amortized sense) “on the fly” as needed.

Loosely speaking, for any state q = 0, 1, …, m and any character a ∈ ∑, the value π[ q] contains the information needed to compute δ( q, a) but that does not depend on a. Since the array π has only m entries, whereas δ has Θ( m |∑|) entries, the KMP algorithm saves a factor of |∑| in the preprocessing time by computing π rather than δ. Like the procedure FINITE-AUTOMATON-MATCHER, once preprocessing has

completed, the KMP algorithm uses Θ( n) matching time.

The prefix function for a pattern

The prefix function π for a pattern encapsulates knowledge about how

the pattern matches against shifts of itself. The KMP algorithm takes

advantage of this information to avoid testing useless shifts in the naive

pattern-matching algorithm and to avoid precomputing the full

transition function δ for a string-matching automaton.

Consider the operation of the naive string matcher. Figure 32.9(a)

shows a particular shift s of a template containing the pattern P =

ababaca against a text T. For this example, q = 5 of the characters

Image 1409

have matched successfully, but the 6th pattern character fails to match

the corresponding text character. The information that q characters

have matched successfully determines the corresponding text characters.

Because these q text characters match, certain shifts must be invalid. In

the example of the figure, the shift s + 1 is necessarily invalid, since the

first pattern character (a) would be aligned with a text character that

does not match the first pattern character, but does match the second

pattern character (b). The shift s′ = s + 2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text

characters that necessarily match.

More generally, suppose that you know that P[: q] ⊐ T[: s + q] or, equivalently, that P[1: q] = T[ s + 1: s + q]. You want to shift P so that some shorter prefix P[: k] of P matches a suffix of T[: s + q], if possible.

You might have more than one choice for how much to shift, however.

In Figure 32.9(b), shifting P by 2 positions works, so that P[:3] ⊐ T[: s +

q], but so does shifting P by 4 positions, so that P[:1] ⊐ T[: s + q] in

Figure 32.9(c). If more than one shift amount works, you should choose the smallest shift amount so that you do not miss any potential matches.

Put more precisely, you want to answer this question:

Given that pattern characters P[1: q] match text characters T[ s +

1: s + q] (that is, P[: q] ⊐ T[: s + q]), what is the least shift s′ > s such that for some k < q,

(that is, P[: k] ⊐ T[: s′ + k]), where s′ + k = s + q?

Here’s another way to look at this question. If you know P[: q] ⊐ T[: s

+ q], then how do you find the longest proper prefix P[: k] of P[: q] that is also a suffix of T[: s + q]? These questions are equivalent because given s and q, requiring s′ + k = s + q means that finding the smallest shift s′ (2

in Figure 32.9(b)) is tantamount to finding the longest prefix length k (3

in Figure 32.9(b)). If you add the difference qk in the lengths of these prefixes of P to the shift s, you get the new shift s′, so that s′ = s + ( q

k). In the best case, k = 0, so that s′ = s + q, immediately ruling out

shifts s + 1, s + 2, …, s + q − 1. In any case, at the new shift s′, it is redundant to compare the first k characters of P with the corresponding characters of T, since equation (32.6) guarantees that they match.

As Figure 32.9(d) demonstrates, you can precompute the necessary

information by comparing the pattern against itself. Since T[ s′ + 1: s′ +

k] is part of the matched portion of the text, it is a suffix of the string

P[: q]. Therefore, think of equation (32.6) as asking for the greatest k < q such that P[: k] ⊐ P[: q]. Then, the new shift s′ = s + ( qk) is the next potentially valid shift. It will be convenient to store, for each value of q,

the number k of matching characters at the new shift s′, rather than storing, say, the amount s′ – s to shift by.

Let’s look at the precomputed information a little more formally. For

a given pattern P[1: m], the prefix function for P is the function π : {1, 2,

…, m} → {0, 1, …, m –} such that

π[ q] = max{ k : k < q and P[: k] ⊐ P[: q]}.

That is, π[ q] is the length of the longest prefix of P that is a proper suffix of P[: q]. Here is the complete prefix function π for the pattern ababaca:

Image 1410

Image 1411

Figure 32.9 The prefix function π. (a) The pattern P = ababaca aligns with a text T so that the first q = 5 characters match. Matching characters, in blue, are connected by blue lines. (b) Knowing these particular 5 matched characters ( P[:5]) suffices to deduce that a shift of s + 1 is invalid, but that a shift of s′ = s + 2 is consistent with everything known about the text and therefore is potentially valid. The prefix P[: k], where k = 3, aligns with the text seen so far. (c) A shift of s + 4 is also potentially valid, but it leaves only the prefix P[:1] aligned with the text seen so far. (d) To precompute useful information for such deductions, compare the pattern with itself. Here, the longest prefix of P that is also a proper suffix of P[:5] is P[:3]. The array π

represents this precomputed information, so that π[5] = 3. Given that q characters have matched successfully at shift s, the next potentially valid shift is at s′ = s + ( qπ[ q]) as shown in part (b).

The procedure KMP-MATCHER on the following page gives the

Knuth-Morris-Pratt matching algorithm. The procedure follows from

FINITE-AUTOMATON-MATCHER for the most part. To compute

π, KMP-MATCHER calls the auxiliary procedure COMPUTE-

PREFIX-FUNCTION. These two procedures have much in common,

because both match a string against the pattern P: KMP-MATCHER

matches the text T against P, and COMPUTE-PREFIX-FUNCTION

matches P against itself.

Next, let’s analyze the running times of these procedures. Then we’ll

prove them correct, which will be more complicated.

Running-time analysis

The running time of COMPUTE-PREFIX-FUNCTION is Θ( m),

which we show by using the aggregate method of amortized analysis

(see Section 16.1). The only tricky part is showing that the while loop of lines 5–6 executes O( m) times altogether. Starting with some

observations about k, we’ll show that it makes at most m–1 iterations.

First, line 3 starts k at 0, and the only way that k increases is by the increment operation in line 8, which executes at most once per iteration

of the for loop of lines 4–9. Thus, the total increase in k is at most m–1.

Second, since k < q upon entering the for loop and each iteration of the loop increments q, we always have k < q. Therefore, the assignments in lines 2 and 9 ensure that π[ q] < q for all q = 1, 2, …, m, which means that each iteration of the while loop decreases k. Third, k never becomes negative. Putting these facts together, we see that the total decrease in k

from the while loop is bounded from above by the total increase in k

over all iterations of the for loop, which is m – 1. Thus, the while loop

iterates at most m – 1 times in all, and COMPUTE-PREFIX-

FUNCTION runs in Θ( m) time.

KMP-MATCHER( T, P, n, m)

1 π = COMPUTE-PREFIX-FUNCTION( P, m)

2 q = 0

// number of characters matched

3 for i = 1 to n

// scan the text from left to right

4

while q > 0 and P[ q + 1] ≠ T[ i]

5

q = π[ q]

// next character does not match

6

if P[ q + 1] == T[ i]

7

q = q + 1

// next character matches

8

if q == m

// is all of P matched?

9

print “Pattern occurs with shift” im

10

q = π[ q]

// look for the next match

COMPUTE-PREFIX-FUNCTION( P, m)

1 let π[1: m] be a new array

2 π[1] = 0

3 k = 0

4 for q = 2 to m

Image 1412

5

while k > 0 and P[ k + 1] ≠ P[ q]

6

k = π[ k]

7

if P[ k + 1] == P[ q]

8

k = k + 1

9

π[ q] = k

10 return π

Exercise 32.4-4 asks you to show, by a similar aggregate analysis,

that the matching time of KMP-MATCHER is Θ( n).

Figure 32.10 An illustration of Lemma 32.5 for the pattern P = ababaca and q = 5. (a) The π

function for the given pattern. Since π[5] = 3, π[3] = 1, and π[1]= 0, iterating π gives π*[5] = {3, 1, 0}. (b) Sliding the template containing the pattern P to the right and noting when some prefix P[: k] of P matches up with some proper suffix of P[:5]. Matches occur when k = 3, 1, and 0. In the figure, the first row gives P, and the vertical red line is drawn just after P[:5]. Successive rows show all the shifts of P that cause some prefix P[: k] of P to match some suffix of P[:5].

Successfully matched characters are shown in blue. Blue lines connect aligned matching characters. Thus, { k : k < 5 and P[: k] ⊐ P[:5]} = {3, 1, 0}. Lemma 32.5 claims that π*[ q] = { k : k

< q and P[: k] ⊐ P[: q]} for all q.

Compared with FINITE-AUTOMATON-MATCHER, by using π

rather than δ, the KMP algorithm reduces the time for preprocessing

the pattern from O( m |∑|) to Θ( m), while keeping the actual matching time bounded by Θ( n).

Correctness of the prefix-function computation

We’ll see a little later that the prefix function π helps to simulate the transition function δ in a string-matching automaton. But first, we need

Image 1413

to prove that the procedure COMPUTE-PREFIX-FUNCTION does

indeed compute the prefix function correctly. Doing so requires finding

all prefixes P[: k] that are proper suffixes of a given prefix P[: q]. The value of π[ q] gives us the length of the longest such prefix, but the following lemma, illustrated in Figure 32.10, shows that iterating the prefix function π generates all the prefixes P[: k] that are proper suffixes of P[: q]. Let

π*[ q] = { π[ q], π(2)[ q], π(3)[ q], …, π( t)[ q]}, where π( i)[ q] is defined in terms of functional iteration, so that π(0)[ q] =

q and π( i)[ q] = π[ π( i−1)[ q]] for i ≥ 1 (so that π[ q] = π(1)[ q]), and where the sequence in π*[ q] stops upon reaching π( t)[ q] = 0 for some t ≥ 1.

Lemma 32.5 (Prefix-function iteration lemma)

Let P be a pattern of length m with prefix function π. Then, for q = 1, 2,

…, m, we have π*[ q] = { k : k < q and P[: k] ⊐ P[: q]}.

Proof We first prove that π*[ q] ⊆ { k : k < q and P[: k] ⊐ P[: q]} or, equivalently,

If iπ*[ q], then i = π( u)[ q] for some u > 0. We prove equation (32.7) by induction on u. For u = 1, we have i = π[ q], and the claim follows since i < q and P[: π[ q]] ⊐ P[: q] by the definition of π. Now consider some u ≥ 1 such that both π( u)[ q] and π( u+1)[ q] belong to π*[ q]. Let i =

π( u)[ q], so that π[ i] = π( u+1)[ q]. The inductive hypothesis is that P[: i] ⊐

P[: q]. Because the relations < and ⊐ are transitive, we have π[ i] < i < q and P[: π[ i]] ⊐ P[: i] ⊐ P[: q], which establishes equation (32.7) for all i in π*[ q]. Therefore, π*[ q] ⊆ { k : k < q and P[: k] ⊐ P[: q]}.

We now prove that { k : k < q and P[: k] ⊐ P[: q]} ⊆ π*[ q] by contradiction. Suppose to the contrary that the set { k : k < q and P[: k]

P[: q]} – π*[ q] is nonempty, and let j be the largest number in the set.

Because π[ q] is the largest value in { k : k < q and P[: k] ⊐ P[: q]} and π*

[ q] ∈ π*[ q], it must be the case that j < [ q]. Having established that π*[ q]

contains at least one integer greater than j, let j′ denote the smallest such integer. (We can choose j′ = π[ q] if no other number in π*[ q] is greater than j.) We have P[: j] ⊐ P[: q] because j ∈ { k : k < q and P[: k] ⊐ P[: q]}, and from j′ ∈ π*[ q] and equation (32.7), we have P[: j′] ⊐ P[: q]. Thus, P[: j] ⊐ P[: j′] by Lemma 32.1, and j is the largest value less than j′ with this property. Therefore, we must have π[ j′] = j and, since j′ ∈ π*[ q], we must have jπ*[ q] as well. This contradiction proves the lemma.

The algorithm COMPUTE-PREFIX-FUNCTION computes π[ q],

in order, for q = 1, 2, …, m. Setting π[1] to 0 in line 2 of COMPUTE-PREFIX-FUNCTION is certainly correct, since π[ q] < q for all q. We’ll use the following lemma and its corollary to prove that COMPUTE-PREFIX-FUNCTION computes π[ q] correctly for q > 1.

Lemma 32.6

Let P be a pattern of length m, and let π be the prefix function for P.

For q = 1, 2, …, m, if π[ q] > 0, then π[ q] – 1 ∈ π*[ q – 1].

Proof Let r = π[ q] > 0, so that r < q and P[: r] ⊐ P[: q], and thus, r – 1 < q – 1 and P[: r – 1] ⊐ P[: q – 1] (by dropping the last character from P[: r]

and P[: q], which we can do because r > 0). By Lemma 32.5, therefore, r

– 1 ∈ π*[ q – 1]. Thus, we have π[ q] – 1 = r – 1 ∈ π*[ q – 1].

For q = 2, 3, …, m, define the subset Eq–1 ⊆ π*[ q – 1] by Eq–1 = { kπ*[ q – 1]: P[ k + 1] = P[ q]}

= { k : k < q – 1 and P[: k] ⊐ P[: q – 1] and P[ k + 1] = P[ q]}

(by Lemma 32.5)

= { k : k < q – 1 and P[: k + 1] ⊐ P[: q]}.

The set Eq–1 consists of the values k < q – 1 for which P[: k] ⊐ P[: q – 1]

and for which, because P[ k + 1] = P[ q], we have P[: k + 1] ⊐ P[: q]. Thus,

Image 1414

Image 1415

Image 1416

Eq–1 consists of those values kπ*[ q – 1] such that extending P[: k] to P[: k + 1] produces a proper suffix of P[: q].

Corollary 32.7

Let P be a pattern of length m, and let π be the prefix function for P.

Then, for q = 2, 3, …, m,

Proof If Eq–1 is empty, there is no kπ*[ q – 1] (including k = 0) such that extending P[: k] to P[: k + 1] produces a proper suffix of P[: q].

Therefore, π[ q]= 0.

If, instead, Eq–1 is nonempty, then for each kEq–1, we have k + 1

< q and P[: k + 1] ⊐ P[: q]. Therefore, the definition of π[ q] gives Note that π[ q] > 0. Let r = π[ q] – 1, so that r + 1 = π[ q] > 0, and therefore P[: r + 1] ⊐ P[: q]. If a nonempty string is a suffix of another, then the two strings must have the same last character. Since r + 1 > 0,

the prefix P[: r + 1] is nonempty, and so P[ r + 1] = P[ q]. Furthermore, r

π*[ q – 1] by Lemma 32.6. Therefore, rEq–1, and so π[ q] – 1 = r

max Eq–1 or, equivalently,

Combining equations (32.8) and (32.9) completes the proof.

We now finish the proof that COMPUTE-PREFIX-FUNCTION

computes π correctly. The key is to combine the definition of Eq–1 with the statement of Corollary 32.7, so that π[ q] equals 1 plus the greatest value of k in π*[ q – 1] such that P[ k + 1] = P[ q]. First, in COMPUTE-PREFIX-FUNCTION, k = π[ q – 1] at the start of each iteration of the for loop of lines 4–9. This condition is enforced by lines 2 and 3 when

the loop is first entered, and it remains true in each successive iteration

because of line 9. Lines 5–8 adjust k so that it becomes the correct value

of π[ q]. The while loop of lines 5–6 searches through all values kπ*[ q

– 1] in decreasing order to find the value of π[ q]. The loop terminates either because k reaches 0 or P[ k + 1] = P[ q]. Because the “and”

operator short-circuits, if the loop terminates because P[ k + 1] = P[ q], then k must have also been positive, and so k is the greatest value in Eq

1. In this case, lines 7–9 set π[ q] to k + 1, according to Corollary 32.7. If, instead, the while loop terminates because k = 0, then there are two possibilities. If P[1] = P[ q], then Eq–1 = {0}, and lines 7–9 set both k and π[ q] to 1. If k = 0 and P[1] ≠ P[ q], however, then Eq–1 = ø;. In this case, line 9 sets π[ q] to 0, again according to Corollary 32.7, which completes the proof of the correctness of COMPUTE-PREFIX-FUNCTION.

Correctness of the Knuth-Morris-Pratt algorithm

You can think of the procedure KMP-MATCHER as a reimplemented

version of the procedure FINITE-AUTOMATON-MATCHER, but

using the prefix function π to compute state transitions. Specifically, we’ll prove that in the i th iteration of the for loops of both KMP-MATCHER and FINITE-AUTOMATON-MATCHER, the state q has

the same value upon testing for equality with m (at line 8 in KMP-

MATCHER and at line 4 in FINITE-AUTOMATON-MATCHER).

Once we have argued that KMP-MATCHER simulates the behavior of

FINITE-AUTOMATON-MATCHER, the correctness of KMP-

MATCHER follows from the correctness of FINITE-AUTOMATON-

MATCHER (though we’ll see a little later why line 10 in KMP-

MATCHER is necessary).

Before formally proving that KMP-MATCHER correctly simulates

FINITE-AUTOMATON-MATCHER, let’s take a moment to

understand how the prefix function π replaces the δ transition function.

Recall that when a string-matching automaton is in state q and it scans

a character a = T[ i], it moves to a new state δ( q, a). If a = P[ q + 1], so that a continues to match the pattern, then the state number is

incremented: δ( q, a) = q + 1. Otherwise, aP[ q + 1], so that a does not continue to match the pattern, and the state number does not increase: 0

δ( q, a) ≤ q. In the first case, when a continues to match, KMP-MATCHER moves to state q + 1 without referring to the π function:

the while loop test in line 4 immediately comes up false, the test in line 6

comes up true, and line 7 increments q.

The π function comes into play when the character a does not continue to match the pattern, so that the new state δ( q, a) is either q or to the left of q along the spine of the automaton. The while loop of lines

4–5 in KMP-MATCHER iterates through the states in π*[ q], stopping

either when it arrives in a state, say q′, such that a matches P[ q′ + 1] or q

has gone all the way down to 0. If a matches P[ q′ + 1], then line 7 sets the new state to q′+1, which should equal δ( q, a) for the simulation to work correctly. In other words, the new state δ( q, a) should be either state 0 or a state numbered 1 more than some state in π*[ q].

Let’s look at the example in Figures 32.6 and 32.10, which are for the pattern P = ababaca. Suppose that the automaton is in state q = 5, having matched ababa. The states in π*[5] are, in descending order, 3,

1, and 0. If the next character scanned is c, then you can see that the

automaton moves to state δ(5, c) = 6 in both FINITE-AUTOMATON-

MATCHER (line 3) and KMP-MATCHER (line 7). Now suppose that

the next character scanned is instead b, so that the automaton should

move to state δ(5, b) = 4. The while loop in KMP-MATCHER exits

after executing line 5 once, and the automaton arrives in state q′ = π[5]

= 3. Since P[ q′ + 1] = P[4] = b, the test in line 6 comes up true, and the automaton moves to the new state q′ + 1 = 4 = δ(5, b). Finally, suppose that the next character scanned is instead a, so that the automaton

should move to state δ(5, a) = 1. The first three times that the test in line

4 executes, the test comes up true. The first time finds that P[6] = c ≠ a,

and the automaton moves to state π[5] = 3 (the first state in π*[5]). The second time finds that P[4] = b ≠ a, and the automaton moves to state

π[3] = 1 (the second state in π*[5]). The third time finds that P[2] = b ≠

a, and the automaton moves to state π[1] = 0 (the last state in π*[5]).

The while loop exits once it arrives in state q′ = 0. Now line 6 finds that

P[ q′ + 1] = P[1] = a, and line 7 moves the automaton to the new state q

+ 1 = 1 = δ(5, a).

Thus, the intuition is that KMP-MATCHER iterates through the

states in π*[ q] in decreasing order, stopping at some state q′ and then possibly moving to state q′+1. Although that might seem like a lot of

work just to simulate computing δ( q, a), bear in mind that asymptotically, KMP-MATCHER is no slower than FINITE-AUTOMATON-MATCHER.

We are now ready to formally prove the correctness of the Knuth-

Morris-Pratt algorithm. By Theorem 32.4, we have that q = σ( T[: i]) after each time line 3 of FINITE-AUTOMATON-MATCHER executes.

Therefore, it suffices to show that the same property holds with regard

to the for loop in KMP-MATCHER. The proof proceeds by induction

on the number of loop iterations. Initially, both procedures set q to 0 as

they enter their respective for loops for the first time. Consider iteration

i of the for loop in KMP-MATCHER. By the inductive hypothesis, the

state number q equals σ( T[: i – 1]) at the start of the loop iteration. We need to show that when line 8 is reached, the new value of q is σ( T[: i]).

(Again, we’ll handle line 10 separately.)

Considering q to be the state number at the start of the for loop

iteration, when KMP-MATCHER considers the character T[ i], the

longest prefix of P that is a suffix of T[: i] is either P[: q + 1] (if P[ q + 1] =

T[ i]) or some prefix (not necessarily proper, and possibly empty) of P[: q]. We consider separately the three cases in which σ( T[: i]) = 0, σ( T[: i]) = q + 1, and 0 < σ( T[: i]) ≤ q.

If σ( T[: i]) = 0, then P[:0] = ϵ is the only prefix of P that is a suffix of T[: i]. The while loop of lines 4–5 iterates through each value q

in π*[ q], but although P[: q′] ⊐ P[: q] ⊐ T[: i – 1] for every q′ ∈ π*[ q]

(because < are ⊐ are transitive relations), the loop never finds a q

such that P[ q′ + 1] = T[ i]. The loop terminates when q reaches 0, and of course line 7 does not execute. Therefore, q = 0 at line 8, so

that now q = σ( T[: i]).

If σ( T[: i]) = q+1, then P[ q+1] = T[ i], and the while loop test in line 4 fails the first time through. Line 7 executes, incrementing the

state number to q + 1, which equals σ( T[: i]).

If 0 < σ( T[: i]) ≤ q′, then the while loop of lines 4–5 iterates at least once, checking in decreasing order each value in π*[ q] until it

stops at some q′ < q. Thus, P[: q′] is the longest prefix of P[: q] for which P[ q′ + 1] = T[ i], so that when the while loop terminates, q′ +

1 = σ( P[: q] T[ i]). Since q = σ( T[: i – 1]), Lemma 32.3 implies that σ( T[: i – 1] T[ i]) = σ( P[: q] T[ i]). Thus we have q′ + 1 = σ( P[: q] T[ i])

= σ( T[: i – 1] T[ i])

= σ( T[: i])

when the while loop terminates. After line 7 increments q, the new

state number q equals σ( T[: i]).

Line 10 is necessary in KMP-MATCHER, because otherwise, line 4

might try to reference P[ m + 1] after finding an occurrence of P. (The argument that q = σ( T[: i – 1]) upon the next execution of line 4 remains valid by the hint given in Exercise 32.4-8: that δ( m, a) = δ( π[ m], a) or, equivalently, σ( Pa) = σ( P[: π[ m]] a) for any a ∈ ∑.) The remaining argument for the correctness of the Knuth-Morris-Pratt algorithm

follows from the correctness of FINITE-AUTOMATON-MATCHER,

since we have shown that KMP-MATCHER simulates the behavior of

FINITE-AUTOMATON-MATCHER.

Exercises

32.4-1

Compute

the

prefix

function

π

for

the

pattern

ababbabbabbababbabb.

32.4-2

Give an upper bound on the size of π*[ q] as a function of q. Give an example to show that your bound is tight.

32.4-3

Explain how to determine the occurrences of pattern P in the text T by

examining the π function for the string PT (the string of length m+ n that is the concatenation of P and T).

32.4-4

Image 1417

Use an aggregate analysis to show that the running time of KMP-

MATCHER is Θ( n).

32.4-5

Use a potential function to show that the running time of KMP-

MATCHER is Θ( n).

32.4-6

Show how to improve KMP-MATCHER by replacing the occurrence

of π in line 5 (but not line 10) by π′, where π′ is defined recursively for q

= 1, 2, …, m – 1 by the equation

Explain why the modified algorithm is correct, and explain in what

sense this change constitutes an improvement.

32.4-7

Give a linear-time algorithm to determine whether a text T is a cyclic

rotation of another string T′. For example, braze and zebra are cyclic

rotations of each other.

32.4-8

Give an O( m |∑|)-time algorithm for computing the transition function δ

for the string-matching automaton corresponding to a given pattern P.

( Hint: Prove that δ( q, a) = δ( π[ q], a) if q = m or P[ q + 1] ≠ a.)

32.5 Suffix arrays

The algorithms we have seen thus far in this chapter can efficiently find

all occurrences of a pattern in a text. That is, however, all they can do.

This section presents a different approach—suffix arrays—with which

you can find all occurrences of a pattern in a text, but also quite a bit

more. A suffix array won’t find all occurrences of a pattern as quickly as,

Image 1418

say, the Knuth-Morris-Pratt algorithm, but its additional flexibility

makes it well worth studying.

Figure 32.11 The suffix array SA, rank array rank, longest common prefix array LCP, and lexicographically sorted suffixes of the text T = ratatat with length n = 7. The value of rank[ i]

indicates the position of the suffix T[ i:] in the lexicographically sorted order: rank[ SA[ i]] = i for i

= 1, 2, …, n. The rank array is used to compute the LCP array.

A suffix array is simply a compact way to represent the

lexicographically sorted order of all n suffixes of a length- n text. Given a text T[1: n], let T[ i:] denote the suffix T[ i: n]. The suffix array SA[1: n] of T

is defined such that if SA[ i] = j, then T[ j:] is the i th suffix of T in lexicographic order. 3 That is, the i th suffix of T in lexicographic order is T[ SA[ i]:]. Along with the suffix array, another useful array is the longest common prefix array LCPOE[1: n]. The entry LCP[ i] gives the length of the longest common prefix between the i th and ( i – 1)st suffixes in the sorted order (with LCP[ SA[1]] defined to be 0, since there is no prefix lexicographically smaller than T[ SA[1]:]). Figure 32.11 shows the suffix array and longest common prefix array for the 7-character text

ratatat.

Given the suffix array for a text, you can search for a pattern via

binary search on the suffix array. Each occurrence of a pattern in the

text starts some suffix of the text, and because the suffix array is in

lexicographically sorted order, all occurrences of a pattern will appear at

the start of consecutive entries of the suffix array. For example, in

Figure 32.11, the three occurrences of at in ratatat appear in entries 1 through 3 of the suffix array. If you find the length- m pattern in

the length- n suffix array via binary search (taking O( m 1g n) time because each comparison takes O( m) time), then you can find all occurrences of the pattern in the text by searching backward and

Image 1419

Image 1420

Image 1421

Image 1422

Image 1423

Image 1424

Image 1425

Image 1426

Image 1427

Image 1428

forward from that spot until you find a suffix that does not start with

the pattern (or you go beyond the bounds of the suffix array). If the

pattern occurs k times, then the time to find all k occurrences is O( m 1g n + km).

With the longest common prefix array, you can find a longest

repeated substring, that is, the longest substring that occurs more than

once in the text. If LCP[ i] contains a maximum value in the LCP array, then a longest repeated substring appears in T[ SA[ i]: SA[ i] + LCP[ i] – 1].

In the example of Figure 32.11, the LCP array has one maximum value: LCP[3] = 4. Therefore, since SA[3] = 2, the longest repeated substring is T[2:5] = atat. Exercise 32.5-3 asks you to use the suffix array and longest common prefix array to find the longest common substrings

between two texts. Next, we’ll see how to compute the suffix array for an

n-character text in O( n 1g n) time and, given the suffix array and the text, how to compute the longest common prefix array in Θ( n) time.

Computing the suffix array

There are several algorithms to compute the suffix array of a length- n

text. Some run in linear time, but are rather complicated. One such

algorithm is given in Problem 32-2. Here we’ll explore a simpler

algorithm that runs in Θ( n 1g n) time.

The idea behind the O( n 1g n)-time procedure COMPUTE-

SUFFIX-ARRAY on the following page is to lexicographically sort

substrings of the text with increasing lengths. The procedure makes

several passes over the text, with the substring length doubling each

time. By the ⌈1g n⌉th pass, the procedure is sorting all the suffixes, thereby gaining the information needed to construct the suffix array.

The key to attaining an O( n 1g n)-time algorithm will be to have each pass after the first sort in linear time, which will indeed be possible by

using radix sort.

Let’s start with a simple observation. Consider any two strings, s 1

and s 2. Decompose s 1 into and , so that s 1 is concatenated with

. Likewise, let s 2 be concatenated with . Now, suppose that is

lexicographically smaller than . Then, regardless of and , it must

Image 1429

Image 1430

Image 1431

Image 1432

Image 1433

Image 1434

Image 1435

Image 1436

be the case that s 1 is lexicographically smaller than s 2. For example, let s 1 = aaz and s 2 = aba, and decompose s 1 into

and

and

s 2 into

and

. Because is lexicographically smaller than

, it follows that s 1 is lexicographically smaller than s 2, even though is lexicographically smaller than .

Instead of comparing substrings directly, COMPUTE-SUFFIX-

ARRAY represents substrings of the text with integer ranks. Ranks

have the simple property that one substring is lexicographically smaller

then another if and only if it has a smaller rank. Identical substrings

have equal ranks.

Where do these ranks come from? Initially, the substrings being

considered are just single characters from the text. Assume that, as in

many programming languages, there is a function, ord, that maps a

character to its underlying encoding, which is a positive integer. The ord

function could be the ASCII or Unicode encodings or any other

function that produces a relative ordering of the characters. For

example if all the characters are known to be lowercase letters, then

ord(a) = 1, ord(b) = 2, …, ord(z) = 26 would work. Once the substrings

being considered contain multiple characters, their ranks will be positive

integers less than or equal to n, coming from their relative order after

being sorted. An empty substring always has rank 0, since it is

lexicographically less than any nonempty substring.

COMPUTE-SUFFIX-ARRAY( T, n)

1

allocate arrays substr-rank[1: n], rank[1: n], and SA[1: n]

2

for i = 1 to n

3

substr-rank[ i]. left-rank = ord( T[ i])

4

if i < n

5

substr-rank[ i]. right-rank = ord( T[ i + 1])

6

else substr-rank[ i]. right-rank = 0

7

substr-rank[ i]. index = i

8

sort the array substr-rank into monotonically increasing order

based on the left-rank attributes, using the right-rank

attributes to break ties; if still a tie, the order does not matter

9

l = 2

10

while l < n

11

MAKE-RANKS( substr-rank, rank, n)

12

for i = 1 to n

13

substr-rank[ i]. left-rank = rank[ i]

14

if i + ln

15

substr-rank[ i]. right-rank = rank[ i + l]

16

else substr-rank[ i]. right-rank = 0

17

substr-rank[ i]. index = i

18

sort the array substr-rank into monotonically increasing

order based on the left-rank attributes, using the right-

rank attributes to break ties; if still a tie, the order does

not matter

19

l = 2 l

20

for i = 1 to n

21

SA[ i] = substr-rank[ i]. index

22

return SA

MAKE-RANKS( substr-rank, rank, n)

1

r = 1

2

rank[ substr-rank[1]. index] = r

3

for i = 2 to n

4

if substr-rank[ i]. left-ranksubstr-rank[ i – 1]. left-rank or substr-rank[ i]. right-ranksubstr-rank[ i – 1]. right-rank

5

r = r + 1

6

rank[ substr-rank[ i]. index] = r

Image 1437

Figure 32.12 The substr-rank array for indices i = 1, 2, …, 7 after the for loop of lines 2–7 and after the sorting step in line 8 for input string T = ratatat.

The COMPUTE-SUFFIX-ARRAY procedure uses objects

internally to keep track of the relative ordering of the substrings

according to their ranks. When considering substrings of a given length,

the procedure creates and sorts an array substr-rank[1: n] of n objects, each with the following attributes:

left-rank contains the rank of the left part of the substring.

right-rank contains the rank of the right part of the substring.

index contains the index into the text T of where the substring starts.

Before delving into the details of how the procedure works, let’s look

at how it operates on the input text ratatat, with n = 7. Assuming

that the ord function returns the ASCII code for a character, Figure

32.12 shows the substr-rank array after the for loop of lines 2–7 and

then after the sorting step in line 8. The left-rank and right-rank values after lines 2–7 are the ranks of length-1 substrings in positions i and i +

1, for i = 1, 2, …, n. These initial ranks are the ASCII values of the characters. At this point, the left-rank and right-rank values give the ranks of the left and right part of each substring of length 2. Because

the substring starting at index 7 consists of only one character, its right

part is empty and so its right-rank is 0. After the sorting step in line 8,

the substr-rank array gives the relative lexicographic order of all the substrings of length 2, with starting points of these substrings in the

index attribute. For example, the lexicographically smallest length-2

substring is at, which starts at position substr-rank[1]. index, which

Image 1438

equals 2. This substring also occurs at positions substr-rank[2]. index = 4

and substr-rank[3]. index = 6.

The procedure then enters the while loop of lines 10–19. The loop

variable l gives an upper bound on the length of substrings that have been sorted thus far. Entering the while loop, therefore, the substrings of

length at most l = 2 are sorted. The call of MAKE-RANKS in line 11

gives each of these substrings its rank in the sorted order, from 1 up to

the number of unique length-2 substrings, based on the values it finds in

the substr-rank array. With l = 2, MAKE-RANKS sets rank[ i] to be the rank of the length-2 substring T[ i: i + 1]. Figure 32.13 shows these new ranks, which are not necessarily unique. For example, since the length-2

substring at occurs at positions 2, 4, and 6, MAKE-RANKS finds that

substr-rank[1], substr-rank[2], and substr-rank[3] have equal values in left-rank and in right-rank. Since substr-rank[1]. index = 2, substr-rank[2]. index = 4, and substr-rank[3]. index = 6, and since at is the smallest substring in lexicographic order, MAKE-RANKS sets rank[2]

= rank[4] = rank[6] = 1.

Figure 32.13 The rank array after line 11 and the substr-rank array after lines 12–17 and after line 18 in the first iteration of the while loop of lines 10–19, where l = 2.

This iteration of the while loop will sort the substrings of length at

most 4 based on the ranks from sorting the substrings of length at most

2. The for loop of lines 12–17 reconstitutes the substr-rank array, with

substr-rank[ i]. left-rank based on rank[ i] (the rank of the length-2

substring T[ i: i+1]) and substr-rank[ i]. right-rank based on rank[ i + 2] (the rank of the length-2 substring T[ i + 2: i + 3], which is 0 if this substring starts beyond the end of the length- n text). Together, these two ranks give the relative rank of the length-4 substring T[ i: i + 3]. Figure 32.13

Image 1439

shows the effect of lines 12–17. The figure also shows the result of

sorting the substr-rank array in line 18, based on the left-rank attribute, and using the right-rank attribute to break ties. Now substr-rank gives the lexicographically sorted order of all substrings with length at most

4. The next iteration of the while loop, with l = 4, sorts the substrings of

length at most 8 based on the ranks from sorting the substrings of

length at most4 4. Figure 32.14 shows the ranks of the length-4

substrings and the substr-rank array before and after sorting. This

iteration is the final one, since with the length n of the text equaling 7,

the procedure has sorted all substrings.

Figure 32.14 The rank array after line 11 and the substr-rank array after lines 12–17 and after line 18 in the second—and final—iteration of the while loop of lines 10–19, where l = 4.

In general, as the loop variable l increases, more and more of the

right parts of the substrings are empty. Therefore, more of the right-rank

values are 0. Because i is at most n within the loop of lines 12–17, the left part of each substring is always nonempty, and so all left-rank

values are always positive.

This example illuminates why the COMPUTE-SUFFIX-ARRAY

procedure works. The initial ranks established in lines 2–7 are simply the

ord values of the characters in the text, and so when line 8 sorts the

substr-rank array, its ordering corresponds to the lexicographic ordering

of the length-2 substrings. Each iteration of the while loop of lines 10–19

takes sorted substrings of length l and produces sorted substrings of length 2 l. Once l reaches or exceeds n, all substrings have been sorted.

Within an iteration of the while loop, the MAKE-RANKS

procedure “re-ranks” the substrings that were sorted, either by line 8

before the first iteration or by line 18 in the previous iteration. MAKE-RANKS takes a substr-rank array, which has been sorted, and fills in an

array rank[1: n] so that rank[ i] is the rank of the i th substring represented in the substr-rank array. Each rank is a positive integer, starting from 1,

and going up to the number of unique substrings of length 2 l.

Substrings with equal values of left-rank and right-rank receive the same rank. Otherwise, a substring that is lexicographically smaller than

another appears earlier in the substr-rank array, and it receives a smaller

rank. Once the substrings of length 2 l are re-ranked, line 18 sorts them

by rank, preparing for the next iteration of the while loop.

Once l reaches or exceeds n and all substrings are sorted, the values

in the index attributes give the starting positions of the sorted

substrings. These indices are precisely the values that constitute the

suffix array.

Let’s analyze the running time of COMPUTE-SUFFIX-ARRAY.

Lines 1–7 take Θ( n) time. Line 8 takes O( n 1g n) time, using either merge sort (see Section 2.3.1) or heapsort (see Chapter 6). Because the value of l doubles in each iteration of the while loop of lines 10–19, this loop makes ⌈1g n⌉ – 1 iterations. Within each iteration, the call of MAKE-RANKS takes Θ( n) time, as does the for loop of lines 12–17. Line 18,

like line 8, takes O( n 1g n) time, using either merge sort or heapsort.

Finally, the for loop of lines 20–21 takes Θ( n) time. The total time works

out to O( n 1g2 n).

A simple observation allows us to reduce the running time to Θ( n 1g

n). The values of left-rank and right-rank being sorted in line 18 are always integers in the range 0 to n. Therefore, radix sort can sort the substr-rank array in Θ( n) time by first running counting sort (see

Chapter 8) based on right-rank and then running counting sort based on left-rank. Now each iteration of the while loop of lines 10–19 takes

only Θ( n) time, giving a total time of Θ( n 1g n).

Exercise 32.5-2 asks you to make a simple modification to

COMPUTE-SUFFIX-ARRAY that allows the while loop of lines 10–

19 to iterate fewer than ⌈1g n⌉ – 1 times for certain inputs.

Computing the LCP array

Recall that LCP[ i] is defined as the length of the longest common prefix of the ( i – 1)st and i th lexicographically smallest suffixes T[ SA[ i – 1]:]

and T[ SA[ i]:]. Because T[ SA[1]:] is the lexicographically smallest suffix, we define LCP[1] to be 0.

In order to compute the LCP array, we need an array rank that is the

inverse of the SA array, just like the final rank array in COMPUTE-SUFFIX-ARRAY: if SA[ i] = j, then rank[ j] = i. That is, we have rank[ SA[ i]] = i for i = 1, 2, …, n. For a suffix T[ i:], the value of rank[ i]

gives the position of this suffix in the lexicographically sorted order.

Figure 32.11 includes the rank array for the ratatat example. For example, the suffix tat is T[5:]. To find this suffix’s position in the sorted order, look up rank[5] = 6.

To compute the LCP array, we will need to determine where in the

lexicographically sorted order a suffix appears, but with its first

character removed. The rank array helps. Consider the i th smallest suffix, which is T[ SA[ i]:]. Dropping its first character gives the suffix T[ SA[ i] + 1:], that is, the suffix starting at position SA[ i] + 1 in the text.

The location of this suffix in the sorted order is given by rank[ SA[ i] + 1].

For example, for the suffix atat, let’s see where to find tat (atat with

its first character removed) in the lexicographically sorted order. The

suffix atat appears in position 2 of the suffix array, and SA[2] = 4.

Thus, rank[ SA[2] + 1] = rank[5] = 6, and sure enough the suffix tat appears in location 6 in the sorted order.

The procedure COMPUTE-LCP on the next page produces the LCP

array. The following lemma helps show that the procedure is correct.

COMPUTE-LCP( T, SA, n)

1 allocate arrays rank[1: n] and LCP[1: n]

2 for i = 1 to n

3

rank[ SA[ i]] = i

// by definition

4 LCP[1] = 0

// also by definition

5 l = 0

// initialize length of LCP

6 for i = 1 to n

7

if rank[ i] > 1

8

j = SA[ rank[ i] – 1] // T[ j:] precedes T[ i:] lexicographically

9

m = max { i, j }

10

while m + ln and T[ i + l] == T[ j + l]

11

l = l + 1

// next character is in common prefix

12

LCP[ rank[ i]] = l

// length of LCP of T[ j:] and T[ i:]

13

if l > 0

14

l = l – 1

// peel off first character of common

prefix

15 return LCP

Lemma 32.8

Consider suffixes T[ i – 1:] and T[ i:], which appear at positions rank[ i – 1]

and rank[ i], respectively, in the lexicographically sorted order of suffixes.

If LCP[ rank[ i – 1]] = l > 1, then the suffix T[ i:], which is T[ i – 1:] with its first character removed, has LCP[ rank[ i]] ≥ l – 1.

Proof The suffix T[ i – 1:] appears at position rank[ i – 1] in the lexicographically sorted order. The suffix immediately preceding it in the

sorted order appears at position rank[ i – 1] – 1 and is T[ SA[ rank[ i – 1] –

1]:]. By assumption and the definition of the LCP array, these two

suffixes, T[ SA[ rank[ i–1]–1]:] and T[ i–1:], have a longest common prefix of length l > 1. Removing the first character from each of these suffixes

gives the suffixes T[ SA[ rank[ i – 1] – 1] + 1:] and T[ i:], respectively. These suffixes have a longest common prefix of length l – 1. If T[ SA[ rank[ i – 1]

– 1] + 1:] immediately precedes T[ i:] in the lexicographically sorted order (that is, if rank[ SA[ rank[ i – 1] – 1] + 1] = rank[ i] – 1), then the lemma is proven.

So now assume that T[ SA[ rank[ i – 1] – 1] + 1:] does not immediately precede T[ i:] in the sorted order. Since T[ SA[ rank[ i – 1] – 1]:]

immediately precedes T[ i–1:] and they have the same first l > 1

characters, T[ SA[ rank[ i – 1] – 1] + 1:] must appear in the sorted order somewhere before T[ i:], with one or more other suffixes intervening.

Each of these suffixes must start with the same l – 1 characters as T[ SA[ rank[ i – 1] – 1] + 1:] and T[ i:], for otherwise it would appear either before T[ SA[ rank[ i – 1] – 1] + 1:] or after T[ i:]. Therefore, whichever suffix appears in position rank[ i] – 1, immediately before T[ i:], has at

least its first l – 1 characters in common with T[ i:]. Thus, LCP[ rank[ i]] ≥

l – 1.

The COMPUTE-LCP procedure works as follows. After allocating

the rank and LCP arrays in line 1, lines 2–3 fill in the rank array and line 4 pegs LCP[1] to 0, per the definition of the LCP array.

The for loop of lines 6–14 fills in the rest of the LCP array going by

decreasing-length suffixes. That is, it fills the position of the LCP array

in the order rank[1], rank[2], rank[3], …, rank[ n], with the assignment occurring in line 12. Upon considering a suffix T[ i:], line 8 determines the suffix T[ j:] that immediately precedes T[ i:] in the lexicographically sorted order. At this point, the longest common prefix of T[ j:] and T[ i:]

has length at least l. This property certainly holds upon the first

iteration of the for loop, when l = 0. Assuming that line 12 sets

LCP[ rank[ i]] correctly, line 14 (which decrements l if it is positive) and Lemma 32.8 maintain this property for the next iteration. The longest

common prefix of T[ j:] and T[ i:] might be even longer than the value of l at the start of the iteration, however. Lines 9–11 increment l for each additional character the prefixes have in common so that it achieves the

length of the longest common prefix. The index m is set in line 9 and

used in the test in line 10 to make sure that the test T[ i + l] == T[ j + l]

for extending the longest common prefix does not run off the end of the

text T. When the while loop of lines 10–11 terminates, l is the length of the longest common prefix of T[ j:] and T[ i:].

As a simple aggregate analysis shows, the COMPUTE-LCP

procedure runs in Θ( n) time. Each of the two for loops iterates n times, and so it remains only to bound the total number of iterations by the

while loop of lines 10–11. Each iteration increases l by 1, and the test m

+ ln ensures that l is always less than n. Because l has an initial value of 0 and decreases at most n – 1 times in line 14, line 11 increments l

fewer than 2 n times. Thus, COMPUTE-LCP takes Θ( n) time.

Exercises

32.5-1

Show the substr-rank and rank arrays before each iteration of the while loop of lines 10–19 and after the last iteration of the while loop, the

suffix array SA returned, and the sorted suffixes when COMPUTE-

SUFFIX-ARRAY is run on the text hippityhoppity. Use the

position of each letter in the alphabet as its ord value, so that ord(b) =

2. Then show the LCP array after each iteration of the for loop of lines

6–14 of COMPUTE-LCP given the text hippityhoppity and its

suffix array.

32.5-2

For some inputs, the COMPUTE-SUFFIX-ARRAY procedure can

produce the correct result with fewer than ⌈1g n⌉ – 1 iterations of the

while loop of lines 10–19. Modify COMPUTE-SUFFIX-ARRAY (and,

if necessary, MAKE-RANKS) so that the procedure can stop before

making all ⌈1g n⌉ – 1 iterations in some cases. Describe an input that

allows the procedure to make O(1) iterations. Describe an input that

forces the procedure to make the maximum number of iterations.

32.5-3

Given two texts, T 1 of length n 1 and T 2 of length n 2, show how to use the suffix array and longest common prefix array to find all of the

longest common substrings, that is, the longest substrings that appear in

both T 1 and T 2. Your algorithm should run in O( n 1g n + kl) time, where n = n 1 + n 2 and there are k such longest substrings, each with length l.

32.5-4

Professor Markram proposes the following method to find the longest

palindromes in a string T[1: n] by using its suffix array and LCP array.

(Recall from Problem 14-2 that a palindrome is a nonempty string that

reads the same forward and backward.)

Let @ be a character that does not appear in T. Construct the

text T′ as the concatenation of T, @, and the reverse of T.

Denote the length of T′ by n′ = 2 n + 1. Create the suffix array SA and LCP array LCP for T′. Since the indices for a

Image 1440

palindrome and its reverse appear in consecutive positions in

the suffix array, find the entries with the maximum LCP value

LCP[ i] such that SA[ i – 1] = n′ – SA[ i] – LCP[ i] + 2. (This constraint prevents a substring—and its reverse—from being

construed as a palindrome unless it really is one.) For each such

index i, one of the longest palindromes is T′[ SA[ i]: SA[ i] +

LCP[ i] – 1].

For example, if the text T is unreferenced, with n = 12, then the

text T′ is unreferenced@decnerefernu, with n′ = 25 and the

following suffix array and LCP array:

The maximum LCP value is achieved at LCP[21] = 5, and SA[20] = 3 =

n′ – SA[21] – LCP[21] + 2. The suffixes of T′ starting at indices SA[20]

and SA[21] are referenced@decnerefernu and refernu, both of

which start with the length-5 palindrome refer.

Alas, this method is not foolproof. Give an input string T that causes

this method to give results that are shorter than the longest palindrome

contained within T, and explain why your input causes the method to

fail.

Problems

32-1 String matching based on repetition factors

Let yi denote the concatenation of string y with itself i times. For example, (ab)3 = ababab. We say that a string x ∈ ∑* has repetition

factor r if x = yr for some string y ∈ ∑* and some r > 0. Let ρ( x) denote the largest r such that x has repetition factor r.

a. Give an efficient algorithm that takes as input a pattern P[1: m] and computes the value ρ( P[: i]) for i = 1, 2, …, m. What is the running time of your algorithm?

b. For any pattern P[1: m], let ρ*( P) be defined as max { ρ( P[: i]) : 1 ≤ i

m}. Prove that if the pattern P is chosen randomly from the set of all

binary strings of length m, then the expected value of ρ*( P) is O(1).

c. Argue that the procedure REPETITION-MATCHER correctly finds

all occurrences of pattern P[1: m] in text T[1: n] in O( ρ*( P) n + m) time.

(This algorithm is due to Galil and Seiferas. By extending these ideas

greatly, they obtained a linear-time string-matching algorithm that

uses only O(1) storage beyond what is required for P and T.)

REPETITION-MATCHER( T, P, n, m)

1 k = 1 + ρ*( P)

2 q = 0

3 s = 0

4 while snm

5

if T[ s + q + 1] == P[ q + 1]

6

q = q + 1

7

if q == m

8

print “Pattern occurs with shift” s

9

if q == m or T[ s + q + 1] ≠ P[ q + 1]

10

s = s + max {1, ⌈ q/ k⌉}

11

q = 0

32-2 A linear-time suffix-array algorithm

In this problem, you will develop and analyze a linear-time divide-and-

conquer algorithm to compute the suffix array of a text T[1: n]. As in

Section 32.5, assume that each character in the text is represented by an underlying encoding, which is a positive integer.

The idea behind the linear-time algorithm is to compute the suffix

array for the suffixes starting at 2/3 of the positions in the text, recursing

as needed, use the resulting information to sort the suffixes starting at

the remaining 1/3 of the positions, and then merge the sorted

information in linear time to produce the full suffix array.

For i = 1, 2, …, n, if i mod 3 equals 1 or 2, then i is a sample position, and the suffixes starting at such positions are sample suffixes. Positions

3, 6, 9, … are nonsample positions, and the suffixes starting at nonsample positions are nonsample suffixes.

The algorithm sorts the sample suffixes, sorts the nonsample suffixes

(aided by the result of sorting the sample suffixes), and merges the

sorted sample and nonsample suffixes. Using the example text T =

bippityboppityboo, here is the algorithm in detail, listing substeps

of each of the above steps:

1. The sample suffixes comprise about 2/3 of the suffixes. Sort them by

the following substeps, which work with a heavily modified version of

T and may require recursion. In part (a) of this problem on page 999,

you will show that the orders of the suffixes of T and the suffixes of

the modified version of T are the same.

A. Construct two texts P 1 and P 2 made up of “metacharacters” that

are actually substrings of three consecutive characters from T. We

delimit each such metacharacter with parentheses. Construct

P 1 = ( T[1:3])( T[4:6])( T[7:9]) ⋯ ( T[ n′: n′ + 2]), where n′ is the largest integer congruent to 1, modulo 3, that is less

than or equal to n and T is extended beyond position n with the special character Ø, with encoding 0. With the example text T =

bippityboppityboo, we get that

P 1 = (bip) (pit) (ybo) (ppi) (tyb) (ooØ).

Similarly, construct

P 2 = ( T[2:4])( T[5:7])( T[8:10]) ⋯ ( T[ n″: n″ + 2]), where n″ is the largest integer congruent to 2, modulo 3, that is less

than or equal to n. For our example, we have

P 2 = (ipp) (ity) (bop) (pit) (ybo) (oØØ).

Image 1441

Figure 32.15 Computed values when sorting the sample suffixes of the linear-time suffix-array algorithm for the text T = bippityboppityboo.

If n is a multiple of 3, append the metacharacter (ØØØ) to the end

of P 1. In this way, P 1 is guaranteed to end with a metacharacter

containing Ø. (This property helps in part (a) of this problem.) The

text P 2 may or may not end with a metacharacter containing Ø.

B. Concatenate P 1 and P 2 to form a new text P. Figure 32.15 shows P

for our example, along with the corresponding positions of T.

C. Sort and rank the unique metacharacters of P, with ranks starting

from 1. In the example, P has 10 unique metacharacters: in sorted

order, they are (bip), (bop), (ipp), (ity), (oØØ), (ooØ), (pit),

(ppi), (tyb), (ybo). The metacharacters (pit) and (ybo) each

appear twice.

D. As Figure 32.15 shows, construct a new “text” P′ by renaming each metacharacter in P by its rank. If P contains k unique

metacharacters, then each “character” in P′ is an integer from 1 to

k. The suffix arrays for P and P′ are identical.

E. Compute the suffix array SAP′ of P′. If the characters of P′ (i.e., the ranks of metacharacters in P) are unique, then you can compute

its suffix array directly, since the ordering of the individual

characters gives the suffix array. Otherwise, recurse to compute the

suffix array of P′, treating the ranks in P′ as the input characters in the recursive call. Figure 32.15 shows the suffix array SAP′ for our example. Since the number of metacharacters in P, and hence the

Image 1442

length of P′, is approximately 2 n/3, this recursive subproblem is

smaller than the current problem.

F. From SAP′ and the positions in T corresponding to the sample

positions, compute the list of positions of the sorted sample suffixes

of the original text T. Figure 32.15 shows the list of positions in T

of the sorted sample suffixes in our example.

2. The nonsample suffixes comprise about 1/3 of the suffixes. Using the

sorted sample suffixes, sort the nonsample suffixes by the following

substeps.

Figure 32.16 The ranks r 1 through rn+3 for the text T = bippityboppityboo with n = 17.

G. Extending the text T by the two special characters ØØ, so that T

now has n + 2 characters, consider each suffix T[ i:] for i = 1, 2, …, n

+ 2. Assign a rank ri to each suffix T[ i:]. For the two special characters ØØ, set rn+1 = rn+2 = 0. For the sample positions of T, base the rank on the list of sorted sample positions of T. The rank

is currently undefined for the nonsample positions of T. For these

positions, set ri = ☐. Figure 32.16 shows the ranks for T =

bippityboppityboo with n = 17.

H. Sort the nonsample suffixes by comparing tuples ( T[ i], ri+1). In our example, we get T[15:] < T[12:] < T[9:] < T[3:] < T[6:] because (b, 6) < (i, 10) < (o, 9) < (p, 8) < (t, 12).

3. Merge the sorted sets of suffixes. From the sorted set of suffixes,

determine the suffix array of T.

This completes the description of a linear-time algorithm for computing

suffix arrays. The following parts of this problem ask you to show that

certain steps of this algorithm are correct and to analyze the algorithm’s running time.

a. Define a nonempty suffix at position i of the text P created in substep B as all metacharacters from position i of P up to and including the

first metacharacter of P in which Ø appears or the end of P. In the

example shown in Figure 32.15, the nonempty suffixes of P starting at positions 1, 4, and 11 of P are (bip) (pit) (ybo) (ppi) (tyb) (ooØ),

(ppi) (tyb) (ooØ), and (ybo) (oØØ), respectively. Prove that the order

of suffixes of P is the same as the order of its nonempty suffixes.

Conclude that the order of suffixes of P gives the order of the sample

suffixes of T. ( Hint: If P contains duplicate metacharacters, consider separately the cases in which two suffixes both start in P 1, both start

in P 2, and one starts in P 1 and the other starts in P 2. Use the property that Ø appears in the last metacharacter of P 1.)

b. Show how to perform substep C in Θ( n) time, bearing in mind that in a recursive call, the characters in T are actually ranks in P′ in the

caller.

c. Argue that the tuples in substep H are unique. Then show how to

perform this substep in Θ( n) time.

d. Consider two suffixes T[ i:] and T[ j:], where T[ i:] is a sample suffix and T[ j:] is a nonsample suffix. Show how to determine in Θ(1) time

whether T[ i:] is lexicographically smaller than T[ j:]. ( Hint: Consider separately the cases in which i mod 3 = 1 and i mod 3 = 2. Compare

tuples whose elements are characters in T and ranks as shown in

Figure 32.16. The number of elements per tuple may depend on

whether i mod 3 equals 1 or 2.) Conclude that step 3 can be performed

in Θ( n) time.

e. Justify the recurrence T( n) ≤ T (2 n/3 + 2) + Θ( n) for the running time of the full algorithm, and show that its solution is O( n). Conclude that the algorithm runs in Θ( n) time.

32-3 Burrows-Wheeler transform

The Burrows-Wheeler transform, or BWT, for a text T is defined as follows. First, append a new character that compares as

lexicographically less than every character of T, and denote this

character by $ and the resulting string by T′. Letting n be the length of T′, create n rows of characters, where each row is one of the n cyclic rotations of T′. Next, sort the rows lexicographically. The BWT is then

the string of n characters in the rightmost column, read top to bottom.

For example, let T = rutabaga, so that T′ = rutabaga$. The

cyclic rotations are

rutabaga$

utabaga$r

tabaga$ru

abaga$rut

baga$ruta

aga$rutab

ga$rutaba

a$rutabag

$rutabaga

Sorting the rows and numbering the sorted rows gives

1 $rutabaga

2 a$rutabag

3 abaga$rut

4 aga$rutab

5 baga$ruta

6 ga$rutaba

7 rutabaga$

8 tabaga$ru

9 utabaga$r

The BWT is the rightmost column, agtbaa$ur. (The row numbering

will be helpful in understanding how to compute the inverse BWT.)

The BWT has applications in bioinformatics, and it can also be a

step in text compression. That is because it tends to place identical

characters together, as in the BWT of rutabaga, which places two of

Image 1443

the instances of a together. When identical characters are placed

together, or even nearby, additional means of compressing become

available. Following the BWT, combinations of move-to-front encoding,

run-length encoding, and Huffman coding (see Section 15.3) can provide significant text compression. Compression ratios with the BWT

tend improve as the text length increases.

a. Given the suffix array for T′, show how to compute the BWT in Θ( n) time.

In order to decompress, the BWT must be invertible. Assuming that

the alphabet size is constant, the inverse BWT can be computed in Θ( n)

time from the BWT. Let’s look at the BWT of rutabaga, denoting it

by BWT[1: n]. Each character in the BWT has a unique lexicographic rank from 1 to n. Denote the rank of BWT[ i] by rank[ i]. If a character appears multiple times in the BWT, each instance of the character has a

rank 1 greater than the previous instance of the character. Here are

BWT and rank for rutabaga:

For example, rank[1] = 2 because BWT[1] = a and the only character

that precedes the first a lexicographically is $ (which we defined to

precede all other characters, so that $ has rank 1). Next, we have rank[2]

= 6 because BWT[2] = g and five characters in the BWT precede g

lexicographically: $, the three instances of a, and b. Jumping ahead to

rank[5] = 3, that is because BWT[5] = a, and because this a is the second instance of a in the BWT, its rank value is 1 greater than the rank value for the previous instance of a, in position 1.

There is enough information in BWT and rank to reconstruct T

from back to front. Suppose that you know the rank r of a character c

in T′. Then c is the first character in row r of the sorted cyclic rotations.

The last character in row r must be the character that precedes c in T′.

But you know which character is the last character in row r, because it is

BWT[ r]. To reconstruct T′ from back to front, start with $, which you

can find in BWT. Then work backward using BWT and rank to reconstruct T′.

Let’s see how this strategy works for rutabaga. The last character

of T′, $, appears in position 7 of BWT. Since rank[7] = 1, row 1 of the sorted cyclic rotations of T′ begins with $. The character that precedes $

in T′ is the last character in row 1, which is BWT[1]: a. Now we know

that the last two characters of T′ are a$. Looking up rank[1], it equals 2, so that row 2 of the sorted cyclic rotations of T′ begins with a. The last

character in row 2 precedes a in T′, and that character is BWT[2] = g.

Now we know that the last three characters of T′ are ga$. Continuing

on, we have rank[2] = 6, so that row 6 of the sorted cyclic rotations begins with g. The character preceding g in T′ is BWT[6] = a, and so

the last four characters of T′ are aga$. Because rank[6] = 4, a begins row 4 of the sorted cyclic rotations of T′. The character preceding a in

T′ is the last character in row 4, BWT[4] = b, and the last five characters of T′ are baga$. And so on, until all n characters of T′ have been identified, from back to front.

b. Given the array BWT[1: n], write pseudcode to compute the array rank[1: n] in Θ( n) time, assuming that the alphabet size is constant.

c. Given the arrays BWT[1: n] and rank[1: n], write pseudocode to compute T′ in Θ( n) time.

Chapter notes

The relation of string matching to the theory of finite automata is

discussed by Aho, Hopcroft, and Ullman [5]. The Knuth-Morris-Pratt algorithm [267] was invented independently by Knuth and Pratt and by Morris, but they published their work jointly. Matiyasevich [317] earlier discovered a similar algorithm, which applied only to an alphabet with

two characters and was specified for a Turing machine with a two-

dimensional tape. Reingold, Urban, and Gries [377] give an alternative treatment of the Knuth-Morris-Pratt algorithm. The Rabin-Karp

algorithm was proposed by Karp and Rabin [250]. Galil and Seiferas

[173] give an interesting deterministic linear-time string-matching

algorithm that uses only O(1) space beyond that required to store the pattern and text.

The suffix-array algorithm in Section 32.5 is by Manber and Myers

[312], who first proposed the notion of suffix arrays. The linear-time algorithm to compute the longest common prefix array presented here is

by Kasai et al. [252]. Problem 32-2 is based on the DC3 algorithm by Kärkkäinen, Sanders, and Burkhardt [245]. For a survey of suffix-array algorithms, see the article by Puglisi, Smyth, and Turpin [370]. To learn more about the Burrows-Wheeler transform from Problem 32-3, see the

articles by Burrows and Wheeler [78] and Manzini [314].

1 For suffix arrays, the preprocessing time of O( n 1g n) comes from the algorithm presented in

Section 32.5. It can be reduced to Θ( n) by using the algorithm in Problem 32-2. The factor k in the matching time denotes the number of occurrences of the pattern in the text.

2 We write Θ( nm + 1) instead of Θ( nm) because s takes on nm + 1 different values. The

“+1” is significant in an asymptotic sense because when m = n, computing the lone ts value takes Θ(1) time, not Θ(0) time.

3 Informally, lexicographic order is “alphabetical order” in the underlying character set. A more precise definition of lexicographic order appears in Problem 12-2 on page 327.

4 Why keep saying “length at most”? Because for a given value of l, a substring of length l starting at position i is T[ i: i + l – 1]. If i + l − 1 > n, then the substring cuts off at the end of the text.

33 Machine-Learning Algorithms

Machine learning may be viewed as a subfield of artificial intelligence.

Broadly speaking, artificial intelligence aims to enable computers to

carry out complex perception and information-processing tasks with

human-like performance. The field of AI is vast and uses many different

algorithmic methods.

Machine learning is rich and fascinating, with strong ties to statistics

and optimization. Technology today produces enormous amounts of

data, providing myriad opportunities for machine-learning algorithms

to formulate and test hypotheses about patterns within the data. These

hypotheses can then be used to make predictions about the

characteristics or classifications in new data. Because machine learning

is particularly good with challenging tasks involving uncertainty, where

observed data follows unknown rules, it has markedly transformed

fields such as medicine, advertising, and speech recognition.

This chapter presents three important machine-learning algorithms:

k-means clustering, multiplicative weights, and gradient descent. You

can view each of these tasks as a learning problem, whereby an

algorithm uses the data collected so far to produce a hypothesis that

describes the regularities learned and/or makes predictions about new

data. The boundaries of machine learning are imprecise and evolving—

some might say that the k-means clustering algorithm should be called

“data science” and not “machine learning,” and gradient descent,

though an immensely important algorithm for machine learning, also

has a multitude of applications outside of machine learning (most

notably for optimization problems).

Machine learning typically starts with a training phase followed by a prediction phase in which predictions are made about new data. For

online learning, the training and prediction phases are intermingled. The

training phase takes as input training data, where each input data point

has an associated output or label; the label might be a category name or

some real-valued attribute. It then produces as an output one or more

hypotheses about how the labels depend on the attributes of the input

data points. Hypotheses can take many forms, typically some type of

formula or algorithm. The learning algorithm used is often a form of

gradient descent. The prediction phase then uses the hypothesis on new

data in order to make predictions regarding the labels of new data

points.

The type of learning just described is known as supervised learning,

since it starts with a set of inputs that are each labeled. As an example,

consider a machine-learning algorithm to recognize spam emails. The

training data comprises a collection of emails, each of which is labeled

either “spam” or “not spam.” The machine-learning algorithm frames a

hypothesis, possibly a rule of the form “if an email has one of a set of

words, then it is likely to be spam.” Or it might learn rules that assign a

spam score to each word and then evaluates a document by the sum of

the spam scores of its constituent words, so that a document with a total

score above a certain threshold value is classified as spam. The machine-

learning algorithm can then predict whether a new email is spam or not.

A second form of machine learning is unsupervised learning, where

the training data is unlabeled, as in the clustering problem of Section

33.1. Here the machine-learning algorithm produces hypotheses

regarding the centers of groups of input data points.

A third form of machine learning (not covered further here) is

reinforcement learning, where the machine-learning algorithm takes

actions in an environment, receives feedback for those actions from the

environment, and then updates its model of the environment based on

the feedback. The learner is in an environment that has some state, and

the actions of the learner have an effect on that state. Reinforcement

learning is a natural choice for situations such as game playing or

operating a self-driving car.

Sometimes the goal in a supervised machine-learning application is

not making accurate predictions of labels for new examples, but rather

performing causal inference: finding an explanatory model that

describes how the various features of an input data point affect its

associated label. Finding a model that fits a given set of training data

well can be tricky. It may involve sophisticated optimization methods

that need to balance between producing a hypothesis that fits the data

well and producing a hypothesis that is simple.

This chapter focuses on three problem domains: finding hypotheses

that group the input data points well (using a clustering algorithm),

learning which predictors (experts) to rely upon for making predictions

in an online learning problem (using the multiplicative-weights

algorithm), and fitting a model to data (using gradient descent).

Section 33.1 considers the clustering problem: how to divide a given set of n training data points into a given number k of groups, or

“clusters,” based on a measure of how similar (or more accurately, how

dissimilar) points are to each other. The approach is iterative, beginning

with an arbitrary initial clustering and incorporating successive

improvements until no further improvements occur. Clustering is often

used as an initial step when working on a machine-learning problem to

discover what structure exists in the data.

Section 33.2 shows how to make online predictions quite accurately when you have a set of predictors, often called “experts,” to rely on,

many of which might be poor predictors, but some of which are good

predictors. At first, you do not know which predictors are poor and

which are good. The goal is to make predictions on new examples that

are nearly as good as the predictions made by the best predictor. We

study an effective multiplicative-weights prediction method that

associates a positive real weight with each predictor and multiplicatively

decreases the weights associated with predictors when they make poor

predictions. The model in this section is online (see Chapter 27): at each step, we do not know anything about the future examples. In addition,

we are able to make predictions even in the presence of adversarial

experts, who are collaborating against us, a situation that actually

happens in game-playing settings.

Finally, Section 33.3 introduces gradient descent, a powerful optimization technique used to find parameter settings in machine-learning models. Gradient descent also has many applications outside of

machine learning. Intuitively, gradient descent finds the value that

produces a local minimum for a function by “walking downhill.” In a

learning application, a “downhill step” is a step that adjusts hypothesis

parameters so that the hypothesis does better on the given set of labeled

examples.

This chapter makes extensive use of vectors. In contrast to the rest of

the book, vector names in this chapter appear in boldface, such as x, to

more clearly delineate which quantities are vectors. Components of

vectors do not appear in boldface, so if vector x has d dimensions, we

might write x = ( x 1, x 2, …, xd).

33.1 Clustering

Suppose that you have a large number of data points (examples), and

you wish to group them into classes based on how similar they are to

each other. For example, each data point might represent a celestial star,

giving its temperature, size, and spectral characteristics. Or, each data

point might represent a fragment of recorded speech. Grouping these

speech fragments appropriately might reveal the set of accents of the

fragments. Once a grouping of the training data points is found, new

data can be placed into an appropriate group, facilitating star-type

recognition or speech recognition.

These situations, along with many others, fall under the umbrella of

clustering. The input to a clustering problem is a set of n examples (objects) and an integer k, with the goal of dividing the examples into at

most k disjoint clusters such that the examples in each cluster are

similar to each other. The clustering problem has several variations. For

example, the integer k might not be given, but instead arises out of the

clustering procedure. In this section we presume that k is given.

Feature vectors and similarity

Image 1444

Let’s formally define the clustering problem. The input is a set of

n examples. Each example has a set of attributes in common with all other examples, though the attribute values may vary among examples.

For example, the clustering problem shown in Figure 33.1 clusters n =

49 examples—48 state capitals plus the District of Columbia—into k =

4 clusters. Each example has two attributes: the latitude and longitude

of the capital. In a given clustering problem, each example has d

attributes, with an example x specified by a d-dimensional feature vector

x = ( x 1, x 2, …, xd).

Here, xa for a = 1, 2, …, d is a real number giving the value of attribute a for example x. We call x the point in ℝ d representing the example. For the example in Figure 33.1, each capital x has its latitude in x 1 and its longitude in x 2.

In order to cluster similar points together, we need to define

similarity. Instead, let’s define the opposite: the dissimilarity Δ(x, y) of

points x and y is the squared Euclidean distance between them:

Of course, for Δ(x, y) to be well defined, all attribute values must be

present. If any are missing, then you might just ignore that example, or

you could fill in a missing attribute value with the median value for that

attribute.

The attribute values are often “messy” in other ways, so that some

“data cleaning” is necessary before the clustering algorithm is run. For

example, the scale of attribute values can vary widely across attributes.

In the example of Figure 33.1, the scales of the two attributes vary by a factor of 2, since latitude ranges from −90 to +90 degrees but longitude

ranges from −180 to +180 degrees. You can imagine other scenarios

where the differences in scales are even greater. If the examples contain

information about students, one attribute might be grade-point average

but another might be family income. Therefore, the attribute values are

usually scaled or normalized, so that no single attribute can dominate the others when computing dissimilarities. One way to do so is by

scaling attribute values with a linear transform so that the minimum

value becomes 0 and the maximum value becomes 1. If the attribute

values are binary values, then no scaling may be needed. Another

option is scaling so that the values for each attribute have mean 0 and

unit variance. Sometimes it makes sense to choose the same scaling rule

for several related attributes (for example, if they are lengths measured

to the same scale).

Image 1445

Figure 33.1 The iterations of Lloyd’s procedure when clustering the capitals of the lower 48

states and the District of Columbia into k = 4 clusters. Each capital has two attributes: latitude and longitude. Each iteration reduces the value f, measuring the sum of squares of distances of all capitals to their cluster centers, until the value of f does not change. (a) The initial four clusters, with the capitals of Arkansas, Kansas, Louisiana, and Tennessee chosen as centers. (b)–

(k) Iterations of Lloyd’s procedure. (l) The 11th iteration results in the same value of f as the 10th iteration in part (k), and so the procedure terminates.

Also, the choice of dissimilarity measure is somewhat arbitrary. The

use of the sum of squared differences as in equation (33.1) is not

required, but it is a conventional choice and mathematically convenient.

For the example of Figure 33.1, you might use the actual distance between capitals rather than equation (33.1).

Clusterings

With the notion of similarity (actually, dis similarity) defined, let’s see how to define clusters of similar points. Let S denote the given set of n

points in ℝ d. In some applications the points are not necessarily

distinct, so that S is a multiset rather than a set.

Because the goal is to create k clusters, we define a k-clustering of S

as a decomposition of S into a sequence 〈 S(1), S(2), …, S( k)〉 of k disjoint subsets, or clusters, so that

S = S(1) ⋃ S(2) ⋃ ⋯ ⋃ S( k).

A cluster may be empty, for example if k > 1 but all of the points in S

have the same attribute values.

There are many ways to define a k-clustering of S and many ways to

evaluate the quality of a given k-clustering. We consider here only k-

clusterings of S that are defined by a sequence C of k centers C = 〈c(1), c(2), …, c( k)〉,

where each center is a point in ℝ d, and the nearest-center rule says that a point x may belong to cluster S(ℓ) if the center of no other cluster is

closer to x than the center c(ℓ) of S(ℓ):

x ∈ S(ℓ) only if Δ(x, c(ℓ)) = min {Δ(x, c( j)): 1 ≤ jk}.

A center can be anywhere, and not necessarily a point in S.

Ties are possible and must be broken so that each point lies in

exactly one cluster. In general, ties may be broken arbitrarily, although

we’ll need the property that we never change which cluster a point x is

assigned to unless the distance from x to its new cluster center is strictly

smaller than the distance from x to its old cluster center. That is, if the

Image 1446

current cluster has a center that is one of the closest cluster centers to x,

then don’t change which cluster x is assigned to.

The k-means problem is then the following: given a set S of n points and a positive integer k, find a sequence C = 〈c(1), c(2), …, c( k)〉 of k center points minimizing the sum f( S, C) of the squared distance from each point to its nearest center, where

In the second line, the k-clustering 〈 S(1), S(2),…, S( k)〉 is defined by the centers C and the nearest-center rule. See Exercise 33.1-1 for an

alternative formulation based on pairwise interpoint distances.

Is there a polynomial-time algorithm for the k-means problem?

Probably not, because it is NP-hard [310]. As we’ll see in Chapter 34, NP-hard problems have no known polynomial-time algorithm, but

nobody has ever proven that polynomial-time algorithms for NP-hard

problems cannot exist. Although we know of no polynomial-time

algorithm that finds the global minimum over all clusterings (according

to equation (33.2)), we can find a local minimum.

Lloyd [304] proposed a simple procedure that finds a sequence C of k centers that yields a local minimum of f( S, C). A local minimum in the k-means problem satisfies two simple properties: each cluster has an

optimal center (defined below), and each point is assigned to the cluster

(or one of the clusters) with the closest center. Lloyd’s procedure finds a

good clustering—possibly optimal—that satisfies these two properties.

These properties are necessary, but not sufficient, for optimality.

Optimal center for a given cluster

In an optimal solution to the k-means problem, each center point must

be the centroid, or mean, of the points in its cluster. The centroid is a d-

dimensional point, where the value in each dimension is the mean of the

values of all the points in the cluster in that dimension (that is, the mean

Image 1447

Image 1448

Image 1449

Image 1450

Image 1451

Image 1452

Image 1453

Image 1454

Image 1455

of the corresponding attribute values in the cluster). That is, if c(ℓ) is the

centroid for cluster S(ℓ), then for attributes a = 1, 2, …, d, we have Over all attributes, we write

Theorem 33.1

Given a nonempty cluster S( ℓ ), its centroid (or mean) is the unique choice for the cluster center c(ℓ) ∈ ℝ d that minimizes

Proof We wish to minimize, by choosing c(ℓ) ∈ ℝ d, the sum

For each attribute a, the term summed is a convex quadratic function in

. To minimize this function, take its derivative with respect to and

set it to 0:

or, equivalently,

Since the minimum is obtained uniquely when each coordinate of is

the average of the corresponding coordinate for x ∈ S(ℓ), the overall

Image 1456

minimum is obtained when c(ℓ) is the centroid of the points x, as in

equation (33.3).

Optimal clusters for given centers

The following theorem shows that the nearest-center rule—assigning

each point x to one of the clusters whose center is nearest to x—yields

an optimal solution to the k-means problem.

Theorem 33.2

Given a set S of n points and a sequence 〈c(1), c(2), …, c( k)〉 of k centers, a clustering 〈 S(1), S(2), …, S( k)〉 minimizes

if and only if it assigns each point x ∈ S to a cluster S(ℓ) that minimizes Δ(x, c(ℓ)).

Proof The proof is straightforward: each point x ∈ S contributes exactly once to the sum (33.4), and choosing to put x in a cluster whose

center is nearest minimizes the contribution from x.

Lloyd’s procedure

Lloyd’s procedure just iterates two operations—assigning points to

clusters based on the nearest-center rule, followed by recomputing the

centers of clusters to be their centroids—until the results converge. Here

is Lloyd’s procedure:

Input: A set S of points in ℝ d, and a positive integer k.

Output: A k-clustering 〈 S(1), S(2), …, S( k)〉 of S with a sequence of centers 〈c(1), c(2), …, c( k)〉.

1. Initialize centers: Generate an initial sequence 〈c(1), c(2), …, c( k)〉 of k centers by picking k points independently from S at random. (If the points are not necessarily distinct, see Exercise

33.1-3.) Assign all points to cluster S(1) to begin.

2. Assign points to clusters: Use the nearest-center rule to define the

clustering 〈 S(1), S(2), …, S( k)〉. That is, assign each point x ∈ S

to a cluster S(ℓ) having a nearest center (breaking ties arbitrarily,

but not changing the assignment for a point x unless the new

cluster center is strictly closer to x than the old one).

3. Stop if no change: If step 2 did not change the assignments of

any points to clusters, then stop and return the clustering 〈 S(1),

S(2), …, S( k)〉 and the associated centers 〈c(1), c(2), …, c( k)〉.

Otherwise, go to step 4.

4. Recompute centers as centroids: For ℓ = 1, 2, …, k, compute the

center c(ℓ) of cluster S(ℓ) as the centroid of the points in S(ℓ). (If

S(ℓ) is empty, let c(ℓ) be the zero vector.) Then go to step 2.

It is possible for some of the clusters returned to be empty, particularly

if many of the input points are identical.

Lloyd’s procedure always terminates. By Theorem 33.1, recomputing

the centers of each cluster as the cluster centroid cannot increase f( S, C). Lloyd’s procedure ensures that a point is reassigned to a different cluster only when such an operation strictly decreases f( S, C). Thus each iteration of Lloyd’s procedure, except the last iteration, must strictly

decrease f( S, C). Since there are only a finite number of possible k-

clusterings of S (at most kn), the procedure must terminate.

Furthermore, once one iteration of Lloyd’s procedure yields no decrease

in f, further iterations would not change anything, and the procedure

can stop at this locally optimum assignment of points to clusters.

If Lloyd’s procedure really required kn iterations, it would be

impractical. In practice, it sometimes suffices to terminate the procedure

when the percentage decrease in f( S, C) in the latest iteration falls below

a predetermined threshold. Because Lloyd’s procedure is guaranteed to find only a locally optimal clustering, one approach to finding a good

clustering is to run Lloyd’s procedure many times with different

randomly chosen initial centers, taking the best result.

The running time of Lloyd’s procedure is proportional to the number

T of iterations. In one iteration, assigning points to clusters based on the nearest-center rule requires O( dkn) time, and recomputing new centers for each cluster requires O( dn) time (because each point is in one cluster). The overall running time of the k-means procedure is thus

O( Tdkn).

Lloyd’s algorithm illustrates an approach common to many

machine-learning algorithms:

First, define a hypothesis space in terms an appropriate sequence

θ of parameters, so that each θ is associated with a specific hypothesis . (For the k-means problem, θ is a dk-dimensional vector, equivalent to C, containing the d-dimensional center of each of the k clusters, and is the hypothesis that each data point x should be grouped with a cluster having a center closest to

x.)

Second, define a measure f( E, θ) describing how poorly hypothesis fits the given training data E. Smaller values of f( E, θ) are better, and a (locally) optimal solution (locally) minimizes f( E, θ).

(For the k-means problem, f( E, θ) is just f( S, C).) Third, given a set of training data E, use a suitable optimization

procedure to find a value of θ* that minimizes f( E, θ*), at least locally. (For the k-means problem, this value of θ* is the sequence

C of k center points returned by Lloyd’s algorithm.)

Return θ* as the answer.

In this framework, we see that optimization becomes a powerful tool for

machine learning. Using optimization in this way is flexible. For

example, regularization terms can be incorporated in the function to be

minimized, in order to penalize hypotheses that are “too complicated”

and that “overfit” the training data. (Regularization is a complex topic that isn’t pursued further here.)

Examples

Figure 33.1 demonstrates Lloyd’s procedure on a set of n = 49 cities: 48

U.S. state capitals and the District of Columbia. Each city has d = 2

dimensions: latitude and longitude. The initial clustering in part (a) of

the figure has the initial cluster centers arbitrarily chosen as the capitals

of Arkansas, Kansas, Louisiana, and Tennessee. As the procedure

iterates, the value of the function f decreases, until the 11th iteration in

part (l), where it remains the same as in the 10th iteration in part (k).

Lloyd’s procedure then terminates with the clusters shown in part (l).

As Figure 33.2 shows, Lloyd’s procedure can also apply to “vector quantization.” Here, the goal is to reduce the number of distinct colors

required to represent a photograph, thereby allowing the photograph to

be greatly compressed (albeit in a lossy manner). In part (a) of the

figure, an original photograph 700 pixels wide and 500 pixels high uses

24 bits (three bytes) per pixel to encode a triple of red, green, and blue

(RGB) primary color intensities. Parts (b)–(e) of the figure show the

results of using Lloyd’s procedure to compress the picture from a initial

space of 224 possible values per pixel to a space of only k = 4, k = 16, k

= 64, or k = 256 possible values per pixel; these k values are the cluster centers. The photograph can then be represented with only 2, 4, 6, or 8

bits per pixel, respectively, instead of the 24-bits per pixel needed by the

initial photograph. An auxiliary table, the “palette,” accompanies the

compressed image; it holds the k 24-bit cluster centers and is used to map each pixel value to its 24-bit cluster center when the photo is

decompressed.

Exercises

33.1-1

Show that the objective function f( S, C) of equation (33.2) may be alternatively written as

Image 1457

Image 1458

33.1-2

Give an example in the plane with n = 4 points and k = 2 clusters where an iteration of Lloyd’s procedure does not improve f( S, C), yet the k-

clustering is not optimal.

33.1-3

When the input to Lloyd’s procedure contains many repeated points, a

different initialization procedure might be used. Describe a way to pick

a number of centers at random that maximizes the number of distinct

centers picked. ( Hint: See Exercise 5.3-5.)

33.1-4

Show how to find an optimal k-clustering in polynomial time when

there is just one attribute ( d = 1).

Image 1459

Figure 33.2 Using Lloyd’s procedure for vector quantization to compress a photo by using fewer colors. (a) The original photo has 350,000 pixels (700 × 500), each a 24-bit RGB (red/blue/green) triple of 8-bit values; these pixels (colors) are the “points” to be clustered. Points repeat, so there are only 79,083 distinct colors (less than 224). After compression, only k distinct colors are used, so each pixel is represented by only ⌈1g k⌉ bits instead of 24. A “palette” maps these values back to 24-bit RGB values (the cluster centers). (b)–(e) The same photo with k = 4, 16, 64, and 256 colors. (Photo from standuppaddle, pixabay.com.)

33.2 Multiplicative-weights algorithms

This section considers problems that require you to make a series of

decisions. After each decision you receive feedback as to whether your

decision was correct. We will study a class of algorithms that are called

multiplicative-weights algorithms. This class of algorithms has a wide

variety of applications, including game playing in economics,

approximately solving linear-programming and multicommodity-flow

problems, and various applications in online machine learning. We

emphasize the online nature of the problem here: you have to make a

sequence of decisions, but some of the information needed to make the

i th decision appears only after you have already made the ( i – 1)st decision. In this section, we look at one particular problem, known as

“learning from experts,” and develop an example of a multiplicative-

weights algorithm, called the weighted-majority algorithm.

Suppose that a series of events will occur, and you want to make

predictions about these events. For example, over a series of days, you

Image 1460

Image 1461

Image 1462

want to predict whether it is going to rain. Or perhaps you want to

predict whether the price of a stock will increase or decrease. One way

to approach this problem is to assemble a group of “experts” and use