7 return δ
The running time of COMPUTE-TRANSITION-FUNCTION is
O( m 3 |∑|), because the outer loops contribute a factor of m |∑|, the inner while loop can run at most m + 1 times, and the test for whether P[: k] is a suffix of P[: q] a on line 4 can require comparing up to m characters.
Much faster procedures exist. By utilizing some cleverly computed
information about the pattern P (see Exercise 32.4-8), the time required
to compute δ from P improves to O( m |∑|). This improved procedure for computing δ provides a way to find all occurrences of a length- m pattern in a length- n text over an alphabet ∑ with O( m |∑|) preprocessing time and Θ( n) matching time.
Exercises
32.3-1
Draw a state-transition diagram for the string-matching automaton for
the pattern P = aabab over the alphabet ∑ = {a, b} and illustrate its
operation on the text string T = aaababaabaababaab.
32.3-2
Draw a state-transition diagram for the string-matching automaton for
the pattern P = ababbabbababbababbabb over the alphabet ∑ = {a,
b}.
32.3-3
A pattern P is nonoverlappable if P[: k] ⊐ P[: q] implies k = 0 or k = q.
Describe the state-transition diagram of the string-matching automaton
for a nonoverlappable pattern.
32.3-4
Let x and y be prefixes of the pattern P. Prove that x ⊐ y implies σ( x) ≤
σ( y).
Given two patterns P and P′, describe how to construct a finite automaton that determines all occurrences of either pattern. Try to
minimize the number of states in your automaton.
32.3-6
Given a pattern P containing gap characters (see Exercise 32.1-4), show
how to build a finite automaton that can find an occurrence of P in a
text T in O( n) matching time, where n = | T|.
★ 32.4 The Knuth-Morris-Pratt algorithm
Knuth, Morris, and Pratt developed a linear-time string matching
algorithm that avoids computing the transition function δ altogether.
Instead, the KMP algorithm uses an auxiliary function π, which it
precomputes from the pattern in Θ( m) time and stores in an array
π[1: m]. The array π allows the algorithm to compute the transition function δ efficiently (in an amortized sense) “on the fly” as needed.
Loosely speaking, for any state q = 0, 1, …, m and any character a ∈ ∑, the value π[ q] contains the information needed to compute δ( q, a) but that does not depend on a. Since the array π has only m entries, whereas δ has Θ( m |∑|) entries, the KMP algorithm saves a factor of |∑| in the preprocessing time by computing π rather than δ. Like the procedure FINITE-AUTOMATON-MATCHER, once preprocessing has
completed, the KMP algorithm uses Θ( n) matching time.
The prefix function for a pattern
The prefix function π for a pattern encapsulates knowledge about how
the pattern matches against shifts of itself. The KMP algorithm takes
advantage of this information to avoid testing useless shifts in the naive
pattern-matching algorithm and to avoid precomputing the full
transition function δ for a string-matching automaton.
Consider the operation of the naive string matcher. Figure 32.9(a)
shows a particular shift s of a template containing the pattern P =
ababaca against a text T. For this example, q = 5 of the characters
have matched successfully, but the 6th pattern character fails to match
the corresponding text character. The information that q characters
have matched successfully determines the corresponding text characters.
Because these q text characters match, certain shifts must be invalid. In
the example of the figure, the shift s + 1 is necessarily invalid, since the
first pattern character (a) would be aligned with a text character that
does not match the first pattern character, but does match the second
pattern character (b). The shift s′ = s + 2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text
characters that necessarily match.
More generally, suppose that you know that P[: q] ⊐ T[: s + q] or, equivalently, that P[1: q] = T[ s + 1: s + q]. You want to shift P so that some shorter prefix P[: k] of P matches a suffix of T[: s + q], if possible.
You might have more than one choice for how much to shift, however.
In Figure 32.9(b), shifting P by 2 positions works, so that P[:3] ⊐ T[: s +
q], but so does shifting P by 4 positions, so that P[:1] ⊐ T[: s + q] in
Figure 32.9(c). If more than one shift amount works, you should choose the smallest shift amount so that you do not miss any potential matches.
Put more precisely, you want to answer this question:
Given that pattern characters P[1: q] match text characters T[ s +
1: s + q] (that is, P[: q] ⊐ T[: s + q]), what is the least shift s′ > s such that for some k < q,
(that is, P[: k] ⊐ T[: s′ + k]), where s′ + k = s + q?
Here’s another way to look at this question. If you know P[: q] ⊐ T[: s
+ q], then how do you find the longest proper prefix P[: k] of P[: q] that is also a suffix of T[: s + q]? These questions are equivalent because given s and q, requiring s′ + k = s + q means that finding the smallest shift s′ (2
in Figure 32.9(b)) is tantamount to finding the longest prefix length k (3
in Figure 32.9(b)). If you add the difference q – k in the lengths of these prefixes of P to the shift s, you get the new shift s′, so that s′ = s + ( q –
k). In the best case, k = 0, so that s′ = s + q, immediately ruling out
shifts s + 1, s + 2, …, s + q − 1. In any case, at the new shift s′, it is redundant to compare the first k characters of P with the corresponding characters of T, since equation (32.6) guarantees that they match.
As Figure 32.9(d) demonstrates, you can precompute the necessary
information by comparing the pattern against itself. Since T[ s′ + 1: s′ +
k] is part of the matched portion of the text, it is a suffix of the string
P[: q]. Therefore, think of equation (32.6) as asking for the greatest k < q such that P[: k] ⊐ P[: q]. Then, the new shift s′ = s + ( q – k) is the next potentially valid shift. It will be convenient to store, for each value of q,
the number k of matching characters at the new shift s′, rather than storing, say, the amount s′ – s to shift by.
Let’s look at the precomputed information a little more formally. For
a given pattern P[1: m], the prefix function for P is the function π : {1, 2,
…, m} → {0, 1, …, m –} such that
π[ q] = max{ k : k < q and P[: k] ⊐ P[: q]}.
That is, π[ q] is the length of the longest prefix of P that is a proper suffix of P[: q]. Here is the complete prefix function π for the pattern ababaca:

Figure 32.9 The prefix function π. (a) The pattern P = ababaca aligns with a text T so that the first q = 5 characters match. Matching characters, in blue, are connected by blue lines. (b) Knowing these particular 5 matched characters ( P[:5]) suffices to deduce that a shift of s + 1 is invalid, but that a shift of s′ = s + 2 is consistent with everything known about the text and therefore is potentially valid. The prefix P[: k], where k = 3, aligns with the text seen so far. (c) A shift of s + 4 is also potentially valid, but it leaves only the prefix P[:1] aligned with the text seen so far. (d) To precompute useful information for such deductions, compare the pattern with itself. Here, the longest prefix of P that is also a proper suffix of P[:5] is P[:3]. The array π
represents this precomputed information, so that π[5] = 3. Given that q characters have matched successfully at shift s, the next potentially valid shift is at s′ = s + ( q – π[ q]) as shown in part (b).
The procedure KMP-MATCHER on the following page gives the
Knuth-Morris-Pratt matching algorithm. The procedure follows from
FINITE-AUTOMATON-MATCHER for the most part. To compute
π, KMP-MATCHER calls the auxiliary procedure COMPUTE-
PREFIX-FUNCTION. These two procedures have much in common,
because both match a string against the pattern P: KMP-MATCHER
matches the text T against P, and COMPUTE-PREFIX-FUNCTION
matches P against itself.
Next, let’s analyze the running times of these procedures. Then we’ll
prove them correct, which will be more complicated.
Running-time analysis
The running time of COMPUTE-PREFIX-FUNCTION is Θ( m),
which we show by using the aggregate method of amortized analysis
(see Section 16.1). The only tricky part is showing that the while loop of lines 5–6 executes O( m) times altogether. Starting with some
observations about k, we’ll show that it makes at most m–1 iterations.
First, line 3 starts k at 0, and the only way that k increases is by the increment operation in line 8, which executes at most once per iteration
of the for loop of lines 4–9. Thus, the total increase in k is at most m–1.
Second, since k < q upon entering the for loop and each iteration of the loop increments q, we always have k < q. Therefore, the assignments in lines 2 and 9 ensure that π[ q] < q for all q = 1, 2, …, m, which means that each iteration of the while loop decreases k. Third, k never becomes negative. Putting these facts together, we see that the total decrease in k
from the while loop is bounded from above by the total increase in k
over all iterations of the for loop, which is m – 1. Thus, the while loop
iterates at most m – 1 times in all, and COMPUTE-PREFIX-
FUNCTION runs in Θ( m) time.
KMP-MATCHER( T, P, n, m)
1 π = COMPUTE-PREFIX-FUNCTION( P, m)
2 q = 0
// number of characters matched
3 for i = 1 to n
// scan the text from left to right
4
while q > 0 and P[ q + 1] ≠ T[ i]
5
q = π[ q]
// next character does not match
6
if P[ q + 1] == T[ i]
7
q = q + 1
// next character matches
8
if q == m
// is all of P matched?
9
print “Pattern occurs with shift” i – m
10
q = π[ q]
// look for the next match
COMPUTE-PREFIX-FUNCTION( P, m)
1 let π[1: m] be a new array
2 π[1] = 0
3 k = 0
4 for q = 2 to m
5
while k > 0 and P[ k + 1] ≠ P[ q]
6
k = π[ k]
7
if P[ k + 1] == P[ q]
8
k = k + 1
9
π[ q] = k
10 return π
Exercise 32.4-4 asks you to show, by a similar aggregate analysis,
that the matching time of KMP-MATCHER is Θ( n).
Figure 32.10 An illustration of Lemma 32.5 for the pattern P = ababaca and q = 5. (a) The π
function for the given pattern. Since π[5] = 3, π[3] = 1, and π[1]= 0, iterating π gives π*[5] = {3, 1, 0}. (b) Sliding the template containing the pattern P to the right and noting when some prefix P[: k] of P matches up with some proper suffix of P[:5]. Matches occur when k = 3, 1, and 0. In the figure, the first row gives P, and the vertical red line is drawn just after P[:5]. Successive rows show all the shifts of P that cause some prefix P[: k] of P to match some suffix of P[:5].
Successfully matched characters are shown in blue. Blue lines connect aligned matching characters. Thus, { k : k < 5 and P[: k] ⊐ P[:5]} = {3, 1, 0}. Lemma 32.5 claims that π*[ q] = { k : k
< q and P[: k] ⊐ P[: q]} for all q.
Compared with FINITE-AUTOMATON-MATCHER, by using π
rather than δ, the KMP algorithm reduces the time for preprocessing
the pattern from O( m |∑|) to Θ( m), while keeping the actual matching time bounded by Θ( n).
Correctness of the prefix-function computation
We’ll see a little later that the prefix function π helps to simulate the transition function δ in a string-matching automaton. But first, we need
to prove that the procedure COMPUTE-PREFIX-FUNCTION does
indeed compute the prefix function correctly. Doing so requires finding
all prefixes P[: k] that are proper suffixes of a given prefix P[: q]. The value of π[ q] gives us the length of the longest such prefix, but the following lemma, illustrated in Figure 32.10, shows that iterating the prefix function π generates all the prefixes P[: k] that are proper suffixes of P[: q]. Let
π*[ q] = { π[ q], π(2)[ q], π(3)[ q], …, π( t)[ q]}, where π( i)[ q] is defined in terms of functional iteration, so that π(0)[ q] =
q and π( i)[ q] = π[ π( i−1)[ q]] for i ≥ 1 (so that π[ q] = π(1)[ q]), and where the sequence in π*[ q] stops upon reaching π( t)[ q] = 0 for some t ≥ 1.
Lemma 32.5 (Prefix-function iteration lemma)
Let P be a pattern of length m with prefix function π. Then, for q = 1, 2,
…, m, we have π*[ q] = { k : k < q and P[: k] ⊐ P[: q]}.
Proof We first prove that π*[ q] ⊆ { k : k < q and P[: k] ⊐ P[: q]} or, equivalently,
If i ∈ π*[ q], then i = π( u)[ q] for some u > 0. We prove equation (32.7) by induction on u. For u = 1, we have i = π[ q], and the claim follows since i < q and P[: π[ q]] ⊐ P[: q] by the definition of π. Now consider some u ≥ 1 such that both π( u)[ q] and π( u+1)[ q] belong to π*[ q]. Let i =
π( u)[ q], so that π[ i] = π( u+1)[ q]. The inductive hypothesis is that P[: i] ⊐
P[: q]. Because the relations < and ⊐ are transitive, we have π[ i] < i < q and P[: π[ i]] ⊐ P[: i] ⊐ P[: q], which establishes equation (32.7) for all i in π*[ q]. Therefore, π*[ q] ⊆ { k : k < q and P[: k] ⊐ P[: q]}.
We now prove that { k : k < q and P[: k] ⊐ P[: q]} ⊆ π*[ q] by contradiction. Suppose to the contrary that the set { k : k < q and P[: k]
⊐ P[: q]} – π*[ q] is nonempty, and let j be the largest number in the set.
Because π[ q] is the largest value in { k : k < q and P[: k] ⊐ P[: q]} and π*
[ q] ∈ π*[ q], it must be the case that j < [ q]. Having established that π*[ q]
contains at least one integer greater than j, let j′ denote the smallest such integer. (We can choose j′ = π[ q] if no other number in π*[ q] is greater than j.) We have P[: j] ⊐ P[: q] because j ∈ { k : k < q and P[: k] ⊐ P[: q]}, and from j′ ∈ π*[ q] and equation (32.7), we have P[: j′] ⊐ P[: q]. Thus, P[: j] ⊐ P[: j′] by Lemma 32.1, and j is the largest value less than j′ with this property. Therefore, we must have π[ j′] = j and, since j′ ∈ π*[ q], we must have j ∈ π*[ q] as well. This contradiction proves the lemma.
▪
The algorithm COMPUTE-PREFIX-FUNCTION computes π[ q],
in order, for q = 1, 2, …, m. Setting π[1] to 0 in line 2 of COMPUTE-PREFIX-FUNCTION is certainly correct, since π[ q] < q for all q. We’ll use the following lemma and its corollary to prove that COMPUTE-PREFIX-FUNCTION computes π[ q] correctly for q > 1.
Lemma 32.6
Let P be a pattern of length m, and let π be the prefix function for P.
For q = 1, 2, …, m, if π[ q] > 0, then π[ q] – 1 ∈ π*[ q – 1].
Proof Let r = π[ q] > 0, so that r < q and P[: r] ⊐ P[: q], and thus, r – 1 < q – 1 and P[: r – 1] ⊐ P[: q – 1] (by dropping the last character from P[: r]
and P[: q], which we can do because r > 0). By Lemma 32.5, therefore, r
– 1 ∈ π*[ q – 1]. Thus, we have π[ q] – 1 = r – 1 ∈ π*[ q – 1].
▪
For q = 2, 3, …, m, define the subset Eq–1 ⊆ π*[ q – 1] by Eq–1 = { k ∈ π*[ q – 1]: P[ k + 1] = P[ q]}
= { k : k < q – 1 and P[: k] ⊐ P[: q – 1] and P[ k + 1] = P[ q]}
(by Lemma 32.5)
= { k : k < q – 1 and P[: k + 1] ⊐ P[: q]}.
The set Eq–1 consists of the values k < q – 1 for which P[: k] ⊐ P[: q – 1]
and for which, because P[ k + 1] = P[ q], we have P[: k + 1] ⊐ P[: q]. Thus,


Eq–1 consists of those values k ∈ π*[ q – 1] such that extending P[: k] to P[: k + 1] produces a proper suffix of P[: q].
Corollary 32.7
Let P be a pattern of length m, and let π be the prefix function for P.
Then, for q = 2, 3, …, m,
Proof If Eq–1 is empty, there is no k ∈ π*[ q – 1] (including k = 0) such that extending P[: k] to P[: k + 1] produces a proper suffix of P[: q].
Therefore, π[ q]= 0.
If, instead, Eq–1 is nonempty, then for each k ∈ Eq–1, we have k + 1
< q and P[: k + 1] ⊐ P[: q]. Therefore, the definition of π[ q] gives Note that π[ q] > 0. Let r = π[ q] – 1, so that r + 1 = π[ q] > 0, and therefore P[: r + 1] ⊐ P[: q]. If a nonempty string is a suffix of another, then the two strings must have the same last character. Since r + 1 > 0,
the prefix P[: r + 1] is nonempty, and so P[ r + 1] = P[ q]. Furthermore, r
∈ π*[ q – 1] by Lemma 32.6. Therefore, r ∈ Eq–1, and so π[ q] – 1 = r ≤
max Eq–1 or, equivalently,
Combining equations (32.8) and (32.9) completes the proof.
▪
We now finish the proof that COMPUTE-PREFIX-FUNCTION
computes π correctly. The key is to combine the definition of Eq–1 with the statement of Corollary 32.7, so that π[ q] equals 1 plus the greatest value of k in π*[ q – 1] such that P[ k + 1] = P[ q]. First, in COMPUTE-PREFIX-FUNCTION, k = π[ q – 1] at the start of each iteration of the for loop of lines 4–9. This condition is enforced by lines 2 and 3 when
the loop is first entered, and it remains true in each successive iteration
because of line 9. Lines 5–8 adjust k so that it becomes the correct value
of π[ q]. The while loop of lines 5–6 searches through all values k ∈ π*[ q
– 1] in decreasing order to find the value of π[ q]. The loop terminates either because k reaches 0 or P[ k + 1] = P[ q]. Because the “and”
operator short-circuits, if the loop terminates because P[ k + 1] = P[ q], then k must have also been positive, and so k is the greatest value in Eq–
1. In this case, lines 7–9 set π[ q] to k + 1, according to Corollary 32.7. If, instead, the while loop terminates because k = 0, then there are two possibilities. If P[1] = P[ q], then Eq–1 = {0}, and lines 7–9 set both k and π[ q] to 1. If k = 0 and P[1] ≠ P[ q], however, then Eq–1 = ø;. In this case, line 9 sets π[ q] to 0, again according to Corollary 32.7, which completes the proof of the correctness of COMPUTE-PREFIX-FUNCTION.
Correctness of the Knuth-Morris-Pratt algorithm
You can think of the procedure KMP-MATCHER as a reimplemented
version of the procedure FINITE-AUTOMATON-MATCHER, but
using the prefix function π to compute state transitions. Specifically, we’ll prove that in the i th iteration of the for loops of both KMP-MATCHER and FINITE-AUTOMATON-MATCHER, the state q has
the same value upon testing for equality with m (at line 8 in KMP-
MATCHER and at line 4 in FINITE-AUTOMATON-MATCHER).
Once we have argued that KMP-MATCHER simulates the behavior of
FINITE-AUTOMATON-MATCHER, the correctness of KMP-
MATCHER follows from the correctness of FINITE-AUTOMATON-
MATCHER (though we’ll see a little later why line 10 in KMP-
MATCHER is necessary).
Before formally proving that KMP-MATCHER correctly simulates
FINITE-AUTOMATON-MATCHER, let’s take a moment to
understand how the prefix function π replaces the δ transition function.
Recall that when a string-matching automaton is in state q and it scans
a character a = T[ i], it moves to a new state δ( q, a). If a = P[ q + 1], so that a continues to match the pattern, then the state number is
incremented: δ( q, a) = q + 1. Otherwise, a ≠ P[ q + 1], so that a does not continue to match the pattern, and the state number does not increase: 0
≤ δ( q, a) ≤ q. In the first case, when a continues to match, KMP-MATCHER moves to state q + 1 without referring to the π function:
the while loop test in line 4 immediately comes up false, the test in line 6
comes up true, and line 7 increments q.
The π function comes into play when the character a does not continue to match the pattern, so that the new state δ( q, a) is either q or to the left of q along the spine of the automaton. The while loop of lines
4–5 in KMP-MATCHER iterates through the states in π*[ q], stopping
either when it arrives in a state, say q′, such that a matches P[ q′ + 1] or q′
has gone all the way down to 0. If a matches P[ q′ + 1], then line 7 sets the new state to q′+1, which should equal δ( q, a) for the simulation to work correctly. In other words, the new state δ( q, a) should be either state 0 or a state numbered 1 more than some state in π*[ q].
Let’s look at the example in Figures 32.6 and 32.10, which are for the pattern P = ababaca. Suppose that the automaton is in state q = 5, having matched ababa. The states in π*[5] are, in descending order, 3,
1, and 0. If the next character scanned is c, then you can see that the
automaton moves to state δ(5, c) = 6 in both FINITE-AUTOMATON-
MATCHER (line 3) and KMP-MATCHER (line 7). Now suppose that
the next character scanned is instead b, so that the automaton should
move to state δ(5, b) = 4. The while loop in KMP-MATCHER exits
after executing line 5 once, and the automaton arrives in state q′ = π[5]
= 3. Since P[ q′ + 1] = P[4] = b, the test in line 6 comes up true, and the automaton moves to the new state q′ + 1 = 4 = δ(5, b). Finally, suppose that the next character scanned is instead a, so that the automaton
should move to state δ(5, a) = 1. The first three times that the test in line
4 executes, the test comes up true. The first time finds that P[6] = c ≠ a,
and the automaton moves to state π[5] = 3 (the first state in π*[5]). The second time finds that P[4] = b ≠ a, and the automaton moves to state
π[3] = 1 (the second state in π*[5]). The third time finds that P[2] = b ≠
a, and the automaton moves to state π[1] = 0 (the last state in π*[5]).
The while loop exits once it arrives in state q′ = 0. Now line 6 finds that
P[ q′ + 1] = P[1] = a, and line 7 moves the automaton to the new state q′
+ 1 = 1 = δ(5, a).
Thus, the intuition is that KMP-MATCHER iterates through the
states in π*[ q] in decreasing order, stopping at some state q′ and then possibly moving to state q′+1. Although that might seem like a lot of
work just to simulate computing δ( q, a), bear in mind that asymptotically, KMP-MATCHER is no slower than FINITE-AUTOMATON-MATCHER.
We are now ready to formally prove the correctness of the Knuth-
Morris-Pratt algorithm. By Theorem 32.4, we have that q = σ( T[: i]) after each time line 3 of FINITE-AUTOMATON-MATCHER executes.
Therefore, it suffices to show that the same property holds with regard
to the for loop in KMP-MATCHER. The proof proceeds by induction
on the number of loop iterations. Initially, both procedures set q to 0 as
they enter their respective for loops for the first time. Consider iteration
i of the for loop in KMP-MATCHER. By the inductive hypothesis, the
state number q equals σ( T[: i – 1]) at the start of the loop iteration. We need to show that when line 8 is reached, the new value of q is σ( T[: i]).
(Again, we’ll handle line 10 separately.)
Considering q to be the state number at the start of the for loop
iteration, when KMP-MATCHER considers the character T[ i], the
longest prefix of P that is a suffix of T[: i] is either P[: q + 1] (if P[ q + 1] =
T[ i]) or some prefix (not necessarily proper, and possibly empty) of P[: q]. We consider separately the three cases in which σ( T[: i]) = 0, σ( T[: i]) = q + 1, and 0 < σ( T[: i]) ≤ q.
If σ( T[: i]) = 0, then P[:0] = ϵ is the only prefix of P that is a suffix of T[: i]. The while loop of lines 4–5 iterates through each value q′
in π*[ q], but although P[: q′] ⊐ P[: q] ⊐ T[: i – 1] for every q′ ∈ π*[ q]
(because < are ⊐ are transitive relations), the loop never finds a q′
such that P[ q′ + 1] = T[ i]. The loop terminates when q reaches 0, and of course line 7 does not execute. Therefore, q = 0 at line 8, so
that now q = σ( T[: i]).
If σ( T[: i]) = q+1, then P[ q+1] = T[ i], and the while loop test in line 4 fails the first time through. Line 7 executes, incrementing the
state number to q + 1, which equals σ( T[: i]).
If 0 < σ( T[: i]) ≤ q′, then the while loop of lines 4–5 iterates at least once, checking in decreasing order each value in π*[ q] until it
stops at some q′ < q. Thus, P[: q′] is the longest prefix of P[: q] for which P[ q′ + 1] = T[ i], so that when the while loop terminates, q′ +
1 = σ( P[: q] T[ i]). Since q = σ( T[: i – 1]), Lemma 32.3 implies that σ( T[: i – 1] T[ i]) = σ( P[: q] T[ i]). Thus we have q′ + 1 = σ( P[: q] T[ i])
= σ( T[: i – 1] T[ i])
= σ( T[: i])
when the while loop terminates. After line 7 increments q, the new
state number q equals σ( T[: i]).
Line 10 is necessary in KMP-MATCHER, because otherwise, line 4
might try to reference P[ m + 1] after finding an occurrence of P. (The argument that q = σ( T[: i – 1]) upon the next execution of line 4 remains valid by the hint given in Exercise 32.4-8: that δ( m, a) = δ( π[ m], a) or, equivalently, σ( Pa) = σ( P[: π[ m]] a) for any a ∈ ∑.) The remaining argument for the correctness of the Knuth-Morris-Pratt algorithm
follows from the correctness of FINITE-AUTOMATON-MATCHER,
since we have shown that KMP-MATCHER simulates the behavior of
FINITE-AUTOMATON-MATCHER.
Exercises
32.4-1
Compute
the
prefix
function
π
for
the
pattern
ababbabbabbababbabb.
32.4-2
Give an upper bound on the size of π*[ q] as a function of q. Give an example to show that your bound is tight.
32.4-3
Explain how to determine the occurrences of pattern P in the text T by
examining the π function for the string PT (the string of length m+ n that is the concatenation of P and T).
32.4-4
Use an aggregate analysis to show that the running time of KMP-
MATCHER is Θ( n).
32.4-5
Use a potential function to show that the running time of KMP-
MATCHER is Θ( n).
32.4-6
Show how to improve KMP-MATCHER by replacing the occurrence
of π in line 5 (but not line 10) by π′, where π′ is defined recursively for q
= 1, 2, …, m – 1 by the equation
Explain why the modified algorithm is correct, and explain in what
sense this change constitutes an improvement.
32.4-7
Give a linear-time algorithm to determine whether a text T is a cyclic
rotation of another string T′. For example, braze and zebra are cyclic
rotations of each other.
★ 32.4-8
Give an O( m |∑|)-time algorithm for computing the transition function δ
for the string-matching automaton corresponding to a given pattern P.
( Hint: Prove that δ( q, a) = δ( π[ q], a) if q = m or P[ q + 1] ≠ a.)
The algorithms we have seen thus far in this chapter can efficiently find
all occurrences of a pattern in a text. That is, however, all they can do.
This section presents a different approach—suffix arrays—with which
you can find all occurrences of a pattern in a text, but also quite a bit
more. A suffix array won’t find all occurrences of a pattern as quickly as,
say, the Knuth-Morris-Pratt algorithm, but its additional flexibility
makes it well worth studying.
Figure 32.11 The suffix array SA, rank array rank, longest common prefix array LCP, and lexicographically sorted suffixes of the text T = ratatat with length n = 7. The value of rank[ i]
indicates the position of the suffix T[ i:] in the lexicographically sorted order: rank[ SA[ i]] = i for i
= 1, 2, …, n. The rank array is used to compute the LCP array.
A suffix array is simply a compact way to represent the
lexicographically sorted order of all n suffixes of a length- n text. Given a text T[1: n], let T[ i:] denote the suffix T[ i: n]. The suffix array SA[1: n] of T
is defined such that if SA[ i] = j, then T[ j:] is the i th suffix of T in lexicographic order. 3 That is, the i th suffix of T in lexicographic order is T[ SA[ i]:]. Along with the suffix array, another useful array is the longest common prefix array LCPOE[1: n]. The entry LCP[ i] gives the length of the longest common prefix between the i th and ( i – 1)st suffixes in the sorted order (with LCP[ SA[1]] defined to be 0, since there is no prefix lexicographically smaller than T[ SA[1]:]). Figure 32.11 shows the suffix array and longest common prefix array for the 7-character text
ratatat.
Given the suffix array for a text, you can search for a pattern via
binary search on the suffix array. Each occurrence of a pattern in the
text starts some suffix of the text, and because the suffix array is in
lexicographically sorted order, all occurrences of a pattern will appear at
the start of consecutive entries of the suffix array. For example, in
Figure 32.11, the three occurrences of at in ratatat appear in entries 1 through 3 of the suffix array. If you find the length- m pattern in
the length- n suffix array via binary search (taking O( m 1g n) time because each comparison takes O( m) time), then you can find all occurrences of the pattern in the text by searching backward and









forward from that spot until you find a suffix that does not start with
the pattern (or you go beyond the bounds of the suffix array). If the
pattern occurs k times, then the time to find all k occurrences is O( m 1g n + km).
With the longest common prefix array, you can find a longest
repeated substring, that is, the longest substring that occurs more than
once in the text. If LCP[ i] contains a maximum value in the LCP array, then a longest repeated substring appears in T[ SA[ i]: SA[ i] + LCP[ i] – 1].
In the example of Figure 32.11, the LCP array has one maximum value: LCP[3] = 4. Therefore, since SA[3] = 2, the longest repeated substring is T[2:5] = atat. Exercise 32.5-3 asks you to use the suffix array and longest common prefix array to find the longest common substrings
between two texts. Next, we’ll see how to compute the suffix array for an
n-character text in O( n 1g n) time and, given the suffix array and the text, how to compute the longest common prefix array in Θ( n) time.
Computing the suffix array
There are several algorithms to compute the suffix array of a length- n
text. Some run in linear time, but are rather complicated. One such
algorithm is given in Problem 32-2. Here we’ll explore a simpler
algorithm that runs in Θ( n 1g n) time.
The idea behind the O( n 1g n)-time procedure COMPUTE-
SUFFIX-ARRAY on the following page is to lexicographically sort
substrings of the text with increasing lengths. The procedure makes
several passes over the text, with the substring length doubling each
time. By the ⌈1g n⌉th pass, the procedure is sorting all the suffixes, thereby gaining the information needed to construct the suffix array.
The key to attaining an O( n 1g n)-time algorithm will be to have each pass after the first sort in linear time, which will indeed be possible by
using radix sort.
Let’s start with a simple observation. Consider any two strings, s 1
and s 2. Decompose s 1 into and , so that s 1 is concatenated with
. Likewise, let s 2 be concatenated with . Now, suppose that is
lexicographically smaller than . Then, regardless of and , it must







be the case that s 1 is lexicographically smaller than s 2. For example, let s 1 = aaz and s 2 = aba, and decompose s 1 into
and
and
s 2 into
and
. Because is lexicographically smaller than
, it follows that s 1 is lexicographically smaller than s 2, even though is lexicographically smaller than .
Instead of comparing substrings directly, COMPUTE-SUFFIX-
ARRAY represents substrings of the text with integer ranks. Ranks
have the simple property that one substring is lexicographically smaller
then another if and only if it has a smaller rank. Identical substrings
have equal ranks.
Where do these ranks come from? Initially, the substrings being
considered are just single characters from the text. Assume that, as in
many programming languages, there is a function, ord, that maps a
character to its underlying encoding, which is a positive integer. The ord
function could be the ASCII or Unicode encodings or any other
function that produces a relative ordering of the characters. For
example if all the characters are known to be lowercase letters, then
ord(a) = 1, ord(b) = 2, …, ord(z) = 26 would work. Once the substrings
being considered contain multiple characters, their ranks will be positive
integers less than or equal to n, coming from their relative order after
being sorted. An empty substring always has rank 0, since it is
lexicographically less than any nonempty substring.
COMPUTE-SUFFIX-ARRAY( T, n)
1
allocate arrays substr-rank[1: n], rank[1: n], and SA[1: n]
2
for i = 1 to n
3
substr-rank[ i]. left-rank = ord( T[ i])
4
if i < n
5
substr-rank[ i]. right-rank = ord( T[ i + 1])
6
else substr-rank[ i]. right-rank = 0
7
substr-rank[ i]. index = i
8
sort the array substr-rank into monotonically increasing order
based on the left-rank attributes, using the right-rank
attributes to break ties; if still a tie, the order does not matter
l = 2
10
while l < n
11
MAKE-RANKS( substr-rank, rank, n)
12
for i = 1 to n
13
substr-rank[ i]. left-rank = rank[ i]
14
if i + l ≤ n
15
substr-rank[ i]. right-rank = rank[ i + l]
16
else substr-rank[ i]. right-rank = 0
17
substr-rank[ i]. index = i
18
sort the array substr-rank into monotonically increasing
order based on the left-rank attributes, using the right-
rank attributes to break ties; if still a tie, the order does
not matter
19
l = 2 l
20
for i = 1 to n
21
SA[ i] = substr-rank[ i]. index
22
return SA
MAKE-RANKS( substr-rank, rank, n)
1
r = 1
2
rank[ substr-rank[1]. index] = r
3
for i = 2 to n
4
if substr-rank[ i]. left-rank ≠ substr-rank[ i – 1]. left-rank or substr-rank[ i]. right-rank ≠ substr-rank[ i – 1]. right-rank
5
r = r + 1
6
rank[ substr-rank[ i]. index] = r
Figure 32.12 The substr-rank array for indices i = 1, 2, …, 7 after the for loop of lines 2–7 and after the sorting step in line 8 for input string T = ratatat.
The COMPUTE-SUFFIX-ARRAY procedure uses objects
internally to keep track of the relative ordering of the substrings
according to their ranks. When considering substrings of a given length,
the procedure creates and sorts an array substr-rank[1: n] of n objects, each with the following attributes:
left-rank contains the rank of the left part of the substring.
right-rank contains the rank of the right part of the substring.
index contains the index into the text T of where the substring starts.
Before delving into the details of how the procedure works, let’s look
at how it operates on the input text ratatat, with n = 7. Assuming
that the ord function returns the ASCII code for a character, Figure
32.12 shows the substr-rank array after the for loop of lines 2–7 and
then after the sorting step in line 8. The left-rank and right-rank values after lines 2–7 are the ranks of length-1 substrings in positions i and i +
1, for i = 1, 2, …, n. These initial ranks are the ASCII values of the characters. At this point, the left-rank and right-rank values give the ranks of the left and right part of each substring of length 2. Because
the substring starting at index 7 consists of only one character, its right
part is empty and so its right-rank is 0. After the sorting step in line 8,
the substr-rank array gives the relative lexicographic order of all the substrings of length 2, with starting points of these substrings in the
index attribute. For example, the lexicographically smallest length-2
substring is at, which starts at position substr-rank[1]. index, which
equals 2. This substring also occurs at positions substr-rank[2]. index = 4
and substr-rank[3]. index = 6.
The procedure then enters the while loop of lines 10–19. The loop
variable l gives an upper bound on the length of substrings that have been sorted thus far. Entering the while loop, therefore, the substrings of
length at most l = 2 are sorted. The call of MAKE-RANKS in line 11
gives each of these substrings its rank in the sorted order, from 1 up to
the number of unique length-2 substrings, based on the values it finds in
the substr-rank array. With l = 2, MAKE-RANKS sets rank[ i] to be the rank of the length-2 substring T[ i: i + 1]. Figure 32.13 shows these new ranks, which are not necessarily unique. For example, since the length-2
substring at occurs at positions 2, 4, and 6, MAKE-RANKS finds that
substr-rank[1], substr-rank[2], and substr-rank[3] have equal values in left-rank and in right-rank. Since substr-rank[1]. index = 2, substr-rank[2]. index = 4, and substr-rank[3]. index = 6, and since at is the smallest substring in lexicographic order, MAKE-RANKS sets rank[2]
= rank[4] = rank[6] = 1.
Figure 32.13 The rank array after line 11 and the substr-rank array after lines 12–17 and after line 18 in the first iteration of the while loop of lines 10–19, where l = 2.
This iteration of the while loop will sort the substrings of length at
most 4 based on the ranks from sorting the substrings of length at most
2. The for loop of lines 12–17 reconstitutes the substr-rank array, with
substr-rank[ i]. left-rank based on rank[ i] (the rank of the length-2
substring T[ i: i+1]) and substr-rank[ i]. right-rank based on rank[ i + 2] (the rank of the length-2 substring T[ i + 2: i + 3], which is 0 if this substring starts beyond the end of the length- n text). Together, these two ranks give the relative rank of the length-4 substring T[ i: i + 3]. Figure 32.13
shows the effect of lines 12–17. The figure also shows the result of
sorting the substr-rank array in line 18, based on the left-rank attribute, and using the right-rank attribute to break ties. Now substr-rank gives the lexicographically sorted order of all substrings with length at most
4. The next iteration of the while loop, with l = 4, sorts the substrings of
length at most 8 based on the ranks from sorting the substrings of
length at most4 4. Figure 32.14 shows the ranks of the length-4
substrings and the substr-rank array before and after sorting. This
iteration is the final one, since with the length n of the text equaling 7,
the procedure has sorted all substrings.
Figure 32.14 The rank array after line 11 and the substr-rank array after lines 12–17 and after line 18 in the second—and final—iteration of the while loop of lines 10–19, where l = 4.
In general, as the loop variable l increases, more and more of the
right parts of the substrings are empty. Therefore, more of the right-rank
values are 0. Because i is at most n within the loop of lines 12–17, the left part of each substring is always nonempty, and so all left-rank
values are always positive.
This example illuminates why the COMPUTE-SUFFIX-ARRAY
procedure works. The initial ranks established in lines 2–7 are simply the
ord values of the characters in the text, and so when line 8 sorts the
substr-rank array, its ordering corresponds to the lexicographic ordering
of the length-2 substrings. Each iteration of the while loop of lines 10–19
takes sorted substrings of length l and produces sorted substrings of length 2 l. Once l reaches or exceeds n, all substrings have been sorted.
Within an iteration of the while loop, the MAKE-RANKS
procedure “re-ranks” the substrings that were sorted, either by line 8
before the first iteration or by line 18 in the previous iteration. MAKE-RANKS takes a substr-rank array, which has been sorted, and fills in an
array rank[1: n] so that rank[ i] is the rank of the i th substring represented in the substr-rank array. Each rank is a positive integer, starting from 1,
and going up to the number of unique substrings of length 2 l.
Substrings with equal values of left-rank and right-rank receive the same rank. Otherwise, a substring that is lexicographically smaller than
another appears earlier in the substr-rank array, and it receives a smaller
rank. Once the substrings of length 2 l are re-ranked, line 18 sorts them
by rank, preparing for the next iteration of the while loop.
Once l reaches or exceeds n and all substrings are sorted, the values
in the index attributes give the starting positions of the sorted
substrings. These indices are precisely the values that constitute the
suffix array.
Let’s analyze the running time of COMPUTE-SUFFIX-ARRAY.
Lines 1–7 take Θ( n) time. Line 8 takes O( n 1g n) time, using either merge sort (see Section 2.3.1) or heapsort (see Chapter 6). Because the value of l doubles in each iteration of the while loop of lines 10–19, this loop makes ⌈1g n⌉ – 1 iterations. Within each iteration, the call of MAKE-RANKS takes Θ( n) time, as does the for loop of lines 12–17. Line 18,
like line 8, takes O( n 1g n) time, using either merge sort or heapsort.
Finally, the for loop of lines 20–21 takes Θ( n) time. The total time works
out to O( n 1g2 n).
A simple observation allows us to reduce the running time to Θ( n 1g
n). The values of left-rank and right-rank being sorted in line 18 are always integers in the range 0 to n. Therefore, radix sort can sort the substr-rank array in Θ( n) time by first running counting sort (see
Chapter 8) based on right-rank and then running counting sort based on left-rank. Now each iteration of the while loop of lines 10–19 takes
only Θ( n) time, giving a total time of Θ( n 1g n).
Exercise 32.5-2 asks you to make a simple modification to
COMPUTE-SUFFIX-ARRAY that allows the while loop of lines 10–
19 to iterate fewer than ⌈1g n⌉ – 1 times for certain inputs.
Computing the LCP array
Recall that LCP[ i] is defined as the length of the longest common prefix of the ( i – 1)st and i th lexicographically smallest suffixes T[ SA[ i – 1]:]
and T[ SA[ i]:]. Because T[ SA[1]:] is the lexicographically smallest suffix, we define LCP[1] to be 0.
In order to compute the LCP array, we need an array rank that is the
inverse of the SA array, just like the final rank array in COMPUTE-SUFFIX-ARRAY: if SA[ i] = j, then rank[ j] = i. That is, we have rank[ SA[ i]] = i for i = 1, 2, …, n. For a suffix T[ i:], the value of rank[ i]
gives the position of this suffix in the lexicographically sorted order.
Figure 32.11 includes the rank array for the ratatat example. For example, the suffix tat is T[5:]. To find this suffix’s position in the sorted order, look up rank[5] = 6.
To compute the LCP array, we will need to determine where in the
lexicographically sorted order a suffix appears, but with its first
character removed. The rank array helps. Consider the i th smallest suffix, which is T[ SA[ i]:]. Dropping its first character gives the suffix T[ SA[ i] + 1:], that is, the suffix starting at position SA[ i] + 1 in the text.
The location of this suffix in the sorted order is given by rank[ SA[ i] + 1].
For example, for the suffix atat, let’s see where to find tat (atat with
its first character removed) in the lexicographically sorted order. The
suffix atat appears in position 2 of the suffix array, and SA[2] = 4.
Thus, rank[ SA[2] + 1] = rank[5] = 6, and sure enough the suffix tat appears in location 6 in the sorted order.
The procedure COMPUTE-LCP on the next page produces the LCP
array. The following lemma helps show that the procedure is correct.
COMPUTE-LCP( T, SA, n)
1 allocate arrays rank[1: n] and LCP[1: n]
2 for i = 1 to n
3
rank[ SA[ i]] = i
// by definition
4 LCP[1] = 0
// also by definition
5 l = 0
// initialize length of LCP
6 for i = 1 to n
7
if rank[ i] > 1
8
j = SA[ rank[ i] – 1] // T[ j:] precedes T[ i:] lexicographically
m = max { i, j }
10
while m + l ≤ n and T[ i + l] == T[ j + l]
11
l = l + 1
// next character is in common prefix
12
LCP[ rank[ i]] = l
// length of LCP of T[ j:] and T[ i:]
13
if l > 0
14
l = l – 1
// peel off first character of common
prefix
15 return LCP
Lemma 32.8
Consider suffixes T[ i – 1:] and T[ i:], which appear at positions rank[ i – 1]
and rank[ i], respectively, in the lexicographically sorted order of suffixes.
If LCP[ rank[ i – 1]] = l > 1, then the suffix T[ i:], which is T[ i – 1:] with its first character removed, has LCP[ rank[ i]] ≥ l – 1.
Proof The suffix T[ i – 1:] appears at position rank[ i – 1] in the lexicographically sorted order. The suffix immediately preceding it in the
sorted order appears at position rank[ i – 1] – 1 and is T[ SA[ rank[ i – 1] –
1]:]. By assumption and the definition of the LCP array, these two
suffixes, T[ SA[ rank[ i–1]–1]:] and T[ i–1:], have a longest common prefix of length l > 1. Removing the first character from each of these suffixes
gives the suffixes T[ SA[ rank[ i – 1] – 1] + 1:] and T[ i:], respectively. These suffixes have a longest common prefix of length l – 1. If T[ SA[ rank[ i – 1]
– 1] + 1:] immediately precedes T[ i:] in the lexicographically sorted order (that is, if rank[ SA[ rank[ i – 1] – 1] + 1] = rank[ i] – 1), then the lemma is proven.
So now assume that T[ SA[ rank[ i – 1] – 1] + 1:] does not immediately precede T[ i:] in the sorted order. Since T[ SA[ rank[ i – 1] – 1]:]
immediately precedes T[ i–1:] and they have the same first l > 1
characters, T[ SA[ rank[ i – 1] – 1] + 1:] must appear in the sorted order somewhere before T[ i:], with one or more other suffixes intervening.
Each of these suffixes must start with the same l – 1 characters as T[ SA[ rank[ i – 1] – 1] + 1:] and T[ i:], for otherwise it would appear either before T[ SA[ rank[ i – 1] – 1] + 1:] or after T[ i:]. Therefore, whichever suffix appears in position rank[ i] – 1, immediately before T[ i:], has at
least its first l – 1 characters in common with T[ i:]. Thus, LCP[ rank[ i]] ≥
l – 1.
▪
The COMPUTE-LCP procedure works as follows. After allocating
the rank and LCP arrays in line 1, lines 2–3 fill in the rank array and line 4 pegs LCP[1] to 0, per the definition of the LCP array.
The for loop of lines 6–14 fills in the rest of the LCP array going by
decreasing-length suffixes. That is, it fills the position of the LCP array
in the order rank[1], rank[2], rank[3], …, rank[ n], with the assignment occurring in line 12. Upon considering a suffix T[ i:], line 8 determines the suffix T[ j:] that immediately precedes T[ i:] in the lexicographically sorted order. At this point, the longest common prefix of T[ j:] and T[ i:]
has length at least l. This property certainly holds upon the first
iteration of the for loop, when l = 0. Assuming that line 12 sets
LCP[ rank[ i]] correctly, line 14 (which decrements l if it is positive) and Lemma 32.8 maintain this property for the next iteration. The longest
common prefix of T[ j:] and T[ i:] might be even longer than the value of l at the start of the iteration, however. Lines 9–11 increment l for each additional character the prefixes have in common so that it achieves the
length of the longest common prefix. The index m is set in line 9 and
used in the test in line 10 to make sure that the test T[ i + l] == T[ j + l]
for extending the longest common prefix does not run off the end of the
text T. When the while loop of lines 10–11 terminates, l is the length of the longest common prefix of T[ j:] and T[ i:].
As a simple aggregate analysis shows, the COMPUTE-LCP
procedure runs in Θ( n) time. Each of the two for loops iterates n times, and so it remains only to bound the total number of iterations by the
while loop of lines 10–11. Each iteration increases l by 1, and the test m
+ l ≤ n ensures that l is always less than n. Because l has an initial value of 0 and decreases at most n – 1 times in line 14, line 11 increments l
fewer than 2 n times. Thus, COMPUTE-LCP takes Θ( n) time.
Exercises
32.5-1
Show the substr-rank and rank arrays before each iteration of the while loop of lines 10–19 and after the last iteration of the while loop, the
suffix array SA returned, and the sorted suffixes when COMPUTE-
SUFFIX-ARRAY is run on the text hippityhoppity. Use the
position of each letter in the alphabet as its ord value, so that ord(b) =
2. Then show the LCP array after each iteration of the for loop of lines
6–14 of COMPUTE-LCP given the text hippityhoppity and its
suffix array.
32.5-2
For some inputs, the COMPUTE-SUFFIX-ARRAY procedure can
produce the correct result with fewer than ⌈1g n⌉ – 1 iterations of the
while loop of lines 10–19. Modify COMPUTE-SUFFIX-ARRAY (and,
if necessary, MAKE-RANKS) so that the procedure can stop before
making all ⌈1g n⌉ – 1 iterations in some cases. Describe an input that
allows the procedure to make O(1) iterations. Describe an input that
forces the procedure to make the maximum number of iterations.
32.5-3
Given two texts, T 1 of length n 1 and T 2 of length n 2, show how to use the suffix array and longest common prefix array to find all of the
longest common substrings, that is, the longest substrings that appear in
both T 1 and T 2. Your algorithm should run in O( n 1g n + kl) time, where n = n 1 + n 2 and there are k such longest substrings, each with length l.
32.5-4
Professor Markram proposes the following method to find the longest
palindromes in a string T[1: n] by using its suffix array and LCP array.
(Recall from Problem 14-2 that a palindrome is a nonempty string that
reads the same forward and backward.)
Let @ be a character that does not appear in T. Construct the
text T′ as the concatenation of T, @, and the reverse of T.
Denote the length of T′ by n′ = 2 n + 1. Create the suffix array SA and LCP array LCP for T′. Since the indices for a
palindrome and its reverse appear in consecutive positions in
the suffix array, find the entries with the maximum LCP value
LCP[ i] such that SA[ i – 1] = n′ – SA[ i] – LCP[ i] + 2. (This constraint prevents a substring—and its reverse—from being
construed as a palindrome unless it really is one.) For each such
index i, one of the longest palindromes is T′[ SA[ i]: SA[ i] +
LCP[ i] – 1].
For example, if the text T is unreferenced, with n = 12, then the
text T′ is unreferenced@decnerefernu, with n′ = 25 and the
following suffix array and LCP array:
The maximum LCP value is achieved at LCP[21] = 5, and SA[20] = 3 =
n′ – SA[21] – LCP[21] + 2. The suffixes of T′ starting at indices SA[20]
and SA[21] are referenced@decnerefernu and refernu, both of
which start with the length-5 palindrome refer.
Alas, this method is not foolproof. Give an input string T that causes
this method to give results that are shorter than the longest palindrome
contained within T, and explain why your input causes the method to
fail.
Problems
32-1 String matching based on repetition factors
Let yi denote the concatenation of string y with itself i times. For example, (ab)3 = ababab. We say that a string x ∈ ∑* has repetition
factor r if x = yr for some string y ∈ ∑* and some r > 0. Let ρ( x) denote the largest r such that x has repetition factor r.
a. Give an efficient algorithm that takes as input a pattern P[1: m] and computes the value ρ( P[: i]) for i = 1, 2, …, m. What is the running time of your algorithm?
b. For any pattern P[1: m], let ρ*( P) be defined as max { ρ( P[: i]) : 1 ≤ i ≤
m}. Prove that if the pattern P is chosen randomly from the set of all
binary strings of length m, then the expected value of ρ*( P) is O(1).
c. Argue that the procedure REPETITION-MATCHER correctly finds
all occurrences of pattern P[1: m] in text T[1: n] in O( ρ*( P) n + m) time.
(This algorithm is due to Galil and Seiferas. By extending these ideas
greatly, they obtained a linear-time string-matching algorithm that
uses only O(1) storage beyond what is required for P and T.)
REPETITION-MATCHER( T, P, n, m)
1 k = 1 + ρ*( P)
2 q = 0
3 s = 0
4 while s ≤ n – m
5
if T[ s + q + 1] == P[ q + 1]
6
q = q + 1
7
if q == m
8
print “Pattern occurs with shift” s
9
if q == m or T[ s + q + 1] ≠ P[ q + 1]
10
s = s + max {1, ⌈ q/ k⌉}
11
q = 0
32-2 A linear-time suffix-array algorithm
In this problem, you will develop and analyze a linear-time divide-and-
conquer algorithm to compute the suffix array of a text T[1: n]. As in
Section 32.5, assume that each character in the text is represented by an underlying encoding, which is a positive integer.
The idea behind the linear-time algorithm is to compute the suffix
array for the suffixes starting at 2/3 of the positions in the text, recursing
as needed, use the resulting information to sort the suffixes starting at
the remaining 1/3 of the positions, and then merge the sorted
information in linear time to produce the full suffix array.
For i = 1, 2, …, n, if i mod 3 equals 1 or 2, then i is a sample position, and the suffixes starting at such positions are sample suffixes. Positions
3, 6, 9, … are nonsample positions, and the suffixes starting at nonsample positions are nonsample suffixes.
The algorithm sorts the sample suffixes, sorts the nonsample suffixes
(aided by the result of sorting the sample suffixes), and merges the
sorted sample and nonsample suffixes. Using the example text T =
bippityboppityboo, here is the algorithm in detail, listing substeps
of each of the above steps:
1. The sample suffixes comprise about 2/3 of the suffixes. Sort them by
the following substeps, which work with a heavily modified version of
T and may require recursion. In part (a) of this problem on page 999,
you will show that the orders of the suffixes of T and the suffixes of
the modified version of T are the same.
A. Construct two texts P 1 and P 2 made up of “metacharacters” that
are actually substrings of three consecutive characters from T. We
delimit each such metacharacter with parentheses. Construct
P 1 = ( T[1:3])( T[4:6])( T[7:9]) ⋯ ( T[ n′: n′ + 2]), where n′ is the largest integer congruent to 1, modulo 3, that is less
than or equal to n and T is extended beyond position n with the special character Ø, with encoding 0. With the example text T =
bippityboppityboo, we get that
P 1 = (bip) (pit) (ybo) (ppi) (tyb) (ooØ).
Similarly, construct
P 2 = ( T[2:4])( T[5:7])( T[8:10]) ⋯ ( T[ n″: n″ + 2]), where n″ is the largest integer congruent to 2, modulo 3, that is less
than or equal to n. For our example, we have
P 2 = (ipp) (ity) (bop) (pit) (ybo) (oØØ).
Figure 32.15 Computed values when sorting the sample suffixes of the linear-time suffix-array algorithm for the text T = bippityboppityboo.
If n is a multiple of 3, append the metacharacter (ØØØ) to the end
of P 1. In this way, P 1 is guaranteed to end with a metacharacter
containing Ø. (This property helps in part (a) of this problem.) The
text P 2 may or may not end with a metacharacter containing Ø.
B. Concatenate P 1 and P 2 to form a new text P. Figure 32.15 shows P
for our example, along with the corresponding positions of T.
C. Sort and rank the unique metacharacters of P, with ranks starting
from 1. In the example, P has 10 unique metacharacters: in sorted
order, they are (bip), (bop), (ipp), (ity), (oØØ), (ooØ), (pit),
(ppi), (tyb), (ybo). The metacharacters (pit) and (ybo) each
appear twice.
D. As Figure 32.15 shows, construct a new “text” P′ by renaming each metacharacter in P by its rank. If P contains k unique
metacharacters, then each “character” in P′ is an integer from 1 to
k. The suffix arrays for P and P′ are identical.
E. Compute the suffix array SAP′ of P′. If the characters of P′ (i.e., the ranks of metacharacters in P) are unique, then you can compute
its suffix array directly, since the ordering of the individual
characters gives the suffix array. Otherwise, recurse to compute the
suffix array of P′, treating the ranks in P′ as the input characters in the recursive call. Figure 32.15 shows the suffix array SAP′ for our example. Since the number of metacharacters in P, and hence the
length of P′, is approximately 2 n/3, this recursive subproblem is
smaller than the current problem.
F. From SAP′ and the positions in T corresponding to the sample
positions, compute the list of positions of the sorted sample suffixes
of the original text T. Figure 32.15 shows the list of positions in T
of the sorted sample suffixes in our example.
2. The nonsample suffixes comprise about 1/3 of the suffixes. Using the
sorted sample suffixes, sort the nonsample suffixes by the following
substeps.
Figure 32.16 The ranks r 1 through rn+3 for the text T = bippityboppityboo with n = 17.
G. Extending the text T by the two special characters ØØ, so that T
now has n + 2 characters, consider each suffix T[ i:] for i = 1, 2, …, n
+ 2. Assign a rank ri to each suffix T[ i:]. For the two special characters ØØ, set rn+1 = rn+2 = 0. For the sample positions of T, base the rank on the list of sorted sample positions of T. The rank
is currently undefined for the nonsample positions of T. For these
positions, set ri = ☐. Figure 32.16 shows the ranks for T =
bippityboppityboo with n = 17.
H. Sort the nonsample suffixes by comparing tuples ( T[ i], ri+1). In our example, we get T[15:] < T[12:] < T[9:] < T[3:] < T[6:] because (b, 6) < (i, 10) < (o, 9) < (p, 8) < (t, 12).
3. Merge the sorted sets of suffixes. From the sorted set of suffixes,
determine the suffix array of T.
This completes the description of a linear-time algorithm for computing
suffix arrays. The following parts of this problem ask you to show that
certain steps of this algorithm are correct and to analyze the algorithm’s running time.
a. Define a nonempty suffix at position i of the text P created in substep B as all metacharacters from position i of P up to and including the
first metacharacter of P in which Ø appears or the end of P. In the
example shown in Figure 32.15, the nonempty suffixes of P starting at positions 1, 4, and 11 of P are (bip) (pit) (ybo) (ppi) (tyb) (ooØ),
(ppi) (tyb) (ooØ), and (ybo) (oØØ), respectively. Prove that the order
of suffixes of P is the same as the order of its nonempty suffixes.
Conclude that the order of suffixes of P gives the order of the sample
suffixes of T. ( Hint: If P contains duplicate metacharacters, consider separately the cases in which two suffixes both start in P 1, both start
in P 2, and one starts in P 1 and the other starts in P 2. Use the property that Ø appears in the last metacharacter of P 1.)
b. Show how to perform substep C in Θ( n) time, bearing in mind that in a recursive call, the characters in T are actually ranks in P′ in the
caller.
c. Argue that the tuples in substep H are unique. Then show how to
perform this substep in Θ( n) time.
d. Consider two suffixes T[ i:] and T[ j:], where T[ i:] is a sample suffix and T[ j:] is a nonsample suffix. Show how to determine in Θ(1) time
whether T[ i:] is lexicographically smaller than T[ j:]. ( Hint: Consider separately the cases in which i mod 3 = 1 and i mod 3 = 2. Compare
tuples whose elements are characters in T and ranks as shown in
Figure 32.16. The number of elements per tuple may depend on
whether i mod 3 equals 1 or 2.) Conclude that step 3 can be performed
in Θ( n) time.
e. Justify the recurrence T( n) ≤ T (2 n/3 + 2) + Θ( n) for the running time of the full algorithm, and show that its solution is O( n). Conclude that the algorithm runs in Θ( n) time.
32-3 Burrows-Wheeler transform
The Burrows-Wheeler transform, or BWT, for a text T is defined as follows. First, append a new character that compares as
lexicographically less than every character of T, and denote this
character by $ and the resulting string by T′. Letting n be the length of T′, create n rows of characters, where each row is one of the n cyclic rotations of T′. Next, sort the rows lexicographically. The BWT is then
the string of n characters in the rightmost column, read top to bottom.
For example, let T = rutabaga, so that T′ = rutabaga$. The
cyclic rotations are
rutabaga$
utabaga$r
tabaga$ru
abaga$rut
baga$ruta
aga$rutab
ga$rutaba
a$rutabag
$rutabaga
Sorting the rows and numbering the sorted rows gives
1 $rutabaga
2 a$rutabag
3 abaga$rut
4 aga$rutab
5 baga$ruta
6 ga$rutaba
7 rutabaga$
8 tabaga$ru
9 utabaga$r
The BWT is the rightmost column, agtbaa$ur. (The row numbering
will be helpful in understanding how to compute the inverse BWT.)
The BWT has applications in bioinformatics, and it can also be a
step in text compression. That is because it tends to place identical
characters together, as in the BWT of rutabaga, which places two of
the instances of a together. When identical characters are placed
together, or even nearby, additional means of compressing become
available. Following the BWT, combinations of move-to-front encoding,
run-length encoding, and Huffman coding (see Section 15.3) can provide significant text compression. Compression ratios with the BWT
tend improve as the text length increases.
a. Given the suffix array for T′, show how to compute the BWT in Θ( n) time.
In order to decompress, the BWT must be invertible. Assuming that
the alphabet size is constant, the inverse BWT can be computed in Θ( n)
time from the BWT. Let’s look at the BWT of rutabaga, denoting it
by BWT[1: n]. Each character in the BWT has a unique lexicographic rank from 1 to n. Denote the rank of BWT[ i] by rank[ i]. If a character appears multiple times in the BWT, each instance of the character has a
rank 1 greater than the previous instance of the character. Here are
BWT and rank for rutabaga:
For example, rank[1] = 2 because BWT[1] = a and the only character
that precedes the first a lexicographically is $ (which we defined to
precede all other characters, so that $ has rank 1). Next, we have rank[2]
= 6 because BWT[2] = g and five characters in the BWT precede g
lexicographically: $, the three instances of a, and b. Jumping ahead to
rank[5] = 3, that is because BWT[5] = a, and because this a is the second instance of a in the BWT, its rank value is 1 greater than the rank value for the previous instance of a, in position 1.
There is enough information in BWT and rank to reconstruct T′
from back to front. Suppose that you know the rank r of a character c
in T′. Then c is the first character in row r of the sorted cyclic rotations.
The last character in row r must be the character that precedes c in T′.
But you know which character is the last character in row r, because it is
BWT[ r]. To reconstruct T′ from back to front, start with $, which you
can find in BWT. Then work backward using BWT and rank to reconstruct T′.
Let’s see how this strategy works for rutabaga. The last character
of T′, $, appears in position 7 of BWT. Since rank[7] = 1, row 1 of the sorted cyclic rotations of T′ begins with $. The character that precedes $
in T′ is the last character in row 1, which is BWT[1]: a. Now we know
that the last two characters of T′ are a$. Looking up rank[1], it equals 2, so that row 2 of the sorted cyclic rotations of T′ begins with a. The last
character in row 2 precedes a in T′, and that character is BWT[2] = g.
Now we know that the last three characters of T′ are ga$. Continuing
on, we have rank[2] = 6, so that row 6 of the sorted cyclic rotations begins with g. The character preceding g in T′ is BWT[6] = a, and so
the last four characters of T′ are aga$. Because rank[6] = 4, a begins row 4 of the sorted cyclic rotations of T′. The character preceding a in
T′ is the last character in row 4, BWT[4] = b, and the last five characters of T′ are baga$. And so on, until all n characters of T′ have been identified, from back to front.
b. Given the array BWT[1: n], write pseudcode to compute the array rank[1: n] in Θ( n) time, assuming that the alphabet size is constant.
c. Given the arrays BWT[1: n] and rank[1: n], write pseudocode to compute T′ in Θ( n) time.
Chapter notes
The relation of string matching to the theory of finite automata is
discussed by Aho, Hopcroft, and Ullman [5]. The Knuth-Morris-Pratt algorithm [267] was invented independently by Knuth and Pratt and by Morris, but they published their work jointly. Matiyasevich [317] earlier discovered a similar algorithm, which applied only to an alphabet with
two characters and was specified for a Turing machine with a two-
dimensional tape. Reingold, Urban, and Gries [377] give an alternative treatment of the Knuth-Morris-Pratt algorithm. The Rabin-Karp
algorithm was proposed by Karp and Rabin [250]. Galil and Seiferas
[173] give an interesting deterministic linear-time string-matching
algorithm that uses only O(1) space beyond that required to store the pattern and text.
The suffix-array algorithm in Section 32.5 is by Manber and Myers
[312], who first proposed the notion of suffix arrays. The linear-time algorithm to compute the longest common prefix array presented here is
by Kasai et al. [252]. Problem 32-2 is based on the DC3 algorithm by Kärkkäinen, Sanders, and Burkhardt [245]. For a survey of suffix-array algorithms, see the article by Puglisi, Smyth, and Turpin [370]. To learn more about the Burrows-Wheeler transform from Problem 32-3, see the
articles by Burrows and Wheeler [78] and Manzini [314].
1 For suffix arrays, the preprocessing time of O( n 1g n) comes from the algorithm presented in
Section 32.5. It can be reduced to Θ( n) by using the algorithm in Problem 32-2. The factor k in the matching time denotes the number of occurrences of the pattern in the text.
2 We write Θ( n – m + 1) instead of Θ( n – m) because s takes on n − m + 1 different values. The
“+1” is significant in an asymptotic sense because when m = n, computing the lone ts value takes Θ(1) time, not Θ(0) time.
3 Informally, lexicographic order is “alphabetical order” in the underlying character set. A more precise definition of lexicographic order appears in Problem 12-2 on page 327.
4 Why keep saying “length at most”? Because for a given value of l, a substring of length l starting at position i is T[ i: i + l – 1]. If i + l − 1 > n, then the substring cuts off at the end of the text.
33 Machine-Learning Algorithms
Machine learning may be viewed as a subfield of artificial intelligence.
Broadly speaking, artificial intelligence aims to enable computers to
carry out complex perception and information-processing tasks with
human-like performance. The field of AI is vast and uses many different
algorithmic methods.
Machine learning is rich and fascinating, with strong ties to statistics
and optimization. Technology today produces enormous amounts of
data, providing myriad opportunities for machine-learning algorithms
to formulate and test hypotheses about patterns within the data. These
hypotheses can then be used to make predictions about the
characteristics or classifications in new data. Because machine learning
is particularly good with challenging tasks involving uncertainty, where
observed data follows unknown rules, it has markedly transformed
fields such as medicine, advertising, and speech recognition.
This chapter presents three important machine-learning algorithms:
k-means clustering, multiplicative weights, and gradient descent. You
can view each of these tasks as a learning problem, whereby an
algorithm uses the data collected so far to produce a hypothesis that
describes the regularities learned and/or makes predictions about new
data. The boundaries of machine learning are imprecise and evolving—
some might say that the k-means clustering algorithm should be called
“data science” and not “machine learning,” and gradient descent,
though an immensely important algorithm for machine learning, also
has a multitude of applications outside of machine learning (most
notably for optimization problems).
Machine learning typically starts with a training phase followed by a prediction phase in which predictions are made about new data. For
online learning, the training and prediction phases are intermingled. The
training phase takes as input training data, where each input data point
has an associated output or label; the label might be a category name or
some real-valued attribute. It then produces as an output one or more
hypotheses about how the labels depend on the attributes of the input
data points. Hypotheses can take many forms, typically some type of
formula or algorithm. The learning algorithm used is often a form of
gradient descent. The prediction phase then uses the hypothesis on new
data in order to make predictions regarding the labels of new data
points.
The type of learning just described is known as supervised learning,
since it starts with a set of inputs that are each labeled. As an example,
consider a machine-learning algorithm to recognize spam emails. The
training data comprises a collection of emails, each of which is labeled
either “spam” or “not spam.” The machine-learning algorithm frames a
hypothesis, possibly a rule of the form “if an email has one of a set of
words, then it is likely to be spam.” Or it might learn rules that assign a
spam score to each word and then evaluates a document by the sum of
the spam scores of its constituent words, so that a document with a total
score above a certain threshold value is classified as spam. The machine-
learning algorithm can then predict whether a new email is spam or not.
A second form of machine learning is unsupervised learning, where
the training data is unlabeled, as in the clustering problem of Section
33.1. Here the machine-learning algorithm produces hypotheses
regarding the centers of groups of input data points.
A third form of machine learning (not covered further here) is
reinforcement learning, where the machine-learning algorithm takes
actions in an environment, receives feedback for those actions from the
environment, and then updates its model of the environment based on
the feedback. The learner is in an environment that has some state, and
the actions of the learner have an effect on that state. Reinforcement
learning is a natural choice for situations such as game playing or
operating a self-driving car.
Sometimes the goal in a supervised machine-learning application is
not making accurate predictions of labels for new examples, but rather
performing causal inference: finding an explanatory model that
describes how the various features of an input data point affect its
associated label. Finding a model that fits a given set of training data
well can be tricky. It may involve sophisticated optimization methods
that need to balance between producing a hypothesis that fits the data
well and producing a hypothesis that is simple.
This chapter focuses on three problem domains: finding hypotheses
that group the input data points well (using a clustering algorithm),
learning which predictors (experts) to rely upon for making predictions
in an online learning problem (using the multiplicative-weights
algorithm), and fitting a model to data (using gradient descent).
Section 33.1 considers the clustering problem: how to divide a given set of n training data points into a given number k of groups, or
“clusters,” based on a measure of how similar (or more accurately, how
dissimilar) points are to each other. The approach is iterative, beginning
with an arbitrary initial clustering and incorporating successive
improvements until no further improvements occur. Clustering is often
used as an initial step when working on a machine-learning problem to
discover what structure exists in the data.
Section 33.2 shows how to make online predictions quite accurately when you have a set of predictors, often called “experts,” to rely on,
many of which might be poor predictors, but some of which are good
predictors. At first, you do not know which predictors are poor and
which are good. The goal is to make predictions on new examples that
are nearly as good as the predictions made by the best predictor. We
study an effective multiplicative-weights prediction method that
associates a positive real weight with each predictor and multiplicatively
decreases the weights associated with predictors when they make poor
predictions. The model in this section is online (see Chapter 27): at each step, we do not know anything about the future examples. In addition,
we are able to make predictions even in the presence of adversarial
experts, who are collaborating against us, a situation that actually
happens in game-playing settings.
Finally, Section 33.3 introduces gradient descent, a powerful optimization technique used to find parameter settings in machine-learning models. Gradient descent also has many applications outside of
machine learning. Intuitively, gradient descent finds the value that
produces a local minimum for a function by “walking downhill.” In a
learning application, a “downhill step” is a step that adjusts hypothesis
parameters so that the hypothesis does better on the given set of labeled
examples.
This chapter makes extensive use of vectors. In contrast to the rest of
the book, vector names in this chapter appear in boldface, such as x, to
more clearly delineate which quantities are vectors. Components of
vectors do not appear in boldface, so if vector x has d dimensions, we
might write x = ( x 1, x 2, …, xd).
Suppose that you have a large number of data points (examples), and
you wish to group them into classes based on how similar they are to
each other. For example, each data point might represent a celestial star,
giving its temperature, size, and spectral characteristics. Or, each data
point might represent a fragment of recorded speech. Grouping these
speech fragments appropriately might reveal the set of accents of the
fragments. Once a grouping of the training data points is found, new
data can be placed into an appropriate group, facilitating star-type
recognition or speech recognition.
These situations, along with many others, fall under the umbrella of
clustering. The input to a clustering problem is a set of n examples (objects) and an integer k, with the goal of dividing the examples into at
most k disjoint clusters such that the examples in each cluster are
similar to each other. The clustering problem has several variations. For
example, the integer k might not be given, but instead arises out of the
clustering procedure. In this section we presume that k is given.
Feature vectors and similarity
Let’s formally define the clustering problem. The input is a set of
n examples. Each example has a set of attributes in common with all other examples, though the attribute values may vary among examples.
For example, the clustering problem shown in Figure 33.1 clusters n =
49 examples—48 state capitals plus the District of Columbia—into k =
4 clusters. Each example has two attributes: the latitude and longitude
of the capital. In a given clustering problem, each example has d
attributes, with an example x specified by a d-dimensional feature vector
x = ( x 1, x 2, …, xd).
Here, xa for a = 1, 2, …, d is a real number giving the value of attribute a for example x. We call x the point in ℝ d representing the example. For the example in Figure 33.1, each capital x has its latitude in x 1 and its longitude in x 2.
In order to cluster similar points together, we need to define
similarity. Instead, let’s define the opposite: the dissimilarity Δ(x, y) of
points x and y is the squared Euclidean distance between them:
Of course, for Δ(x, y) to be well defined, all attribute values must be
present. If any are missing, then you might just ignore that example, or
you could fill in a missing attribute value with the median value for that
attribute.
The attribute values are often “messy” in other ways, so that some
“data cleaning” is necessary before the clustering algorithm is run. For
example, the scale of attribute values can vary widely across attributes.
In the example of Figure 33.1, the scales of the two attributes vary by a factor of 2, since latitude ranges from −90 to +90 degrees but longitude
ranges from −180 to +180 degrees. You can imagine other scenarios
where the differences in scales are even greater. If the examples contain
information about students, one attribute might be grade-point average
but another might be family income. Therefore, the attribute values are
usually scaled or normalized, so that no single attribute can dominate the others when computing dissimilarities. One way to do so is by
scaling attribute values with a linear transform so that the minimum
value becomes 0 and the maximum value becomes 1. If the attribute
values are binary values, then no scaling may be needed. Another
option is scaling so that the values for each attribute have mean 0 and
unit variance. Sometimes it makes sense to choose the same scaling rule
for several related attributes (for example, if they are lengths measured
to the same scale).
Figure 33.1 The iterations of Lloyd’s procedure when clustering the capitals of the lower 48
states and the District of Columbia into k = 4 clusters. Each capital has two attributes: latitude and longitude. Each iteration reduces the value f, measuring the sum of squares of distances of all capitals to their cluster centers, until the value of f does not change. (a) The initial four clusters, with the capitals of Arkansas, Kansas, Louisiana, and Tennessee chosen as centers. (b)–
(k) Iterations of Lloyd’s procedure. (l) The 11th iteration results in the same value of f as the 10th iteration in part (k), and so the procedure terminates.
Also, the choice of dissimilarity measure is somewhat arbitrary. The
use of the sum of squared differences as in equation (33.1) is not
required, but it is a conventional choice and mathematically convenient.
For the example of Figure 33.1, you might use the actual distance between capitals rather than equation (33.1).
Clusterings
With the notion of similarity (actually, dis similarity) defined, let’s see how to define clusters of similar points. Let S denote the given set of n
points in ℝ d. In some applications the points are not necessarily
distinct, so that S is a multiset rather than a set.
Because the goal is to create k clusters, we define a k-clustering of S
as a decomposition of S into a sequence 〈 S(1), S(2), …, S( k)〉 of k disjoint subsets, or clusters, so that
S = S(1) ⋃ S(2) ⋃ ⋯ ⋃ S( k).
A cluster may be empty, for example if k > 1 but all of the points in S
have the same attribute values.
There are many ways to define a k-clustering of S and many ways to
evaluate the quality of a given k-clustering. We consider here only k-
clusterings of S that are defined by a sequence C of k centers C = 〈c(1), c(2), …, c( k)〉,
where each center is a point in ℝ d, and the nearest-center rule says that a point x may belong to cluster S(ℓ) if the center of no other cluster is
closer to x than the center c(ℓ) of S(ℓ):
x ∈ S(ℓ) only if Δ(x, c(ℓ)) = min {Δ(x, c( j)): 1 ≤ j ≤ k}.
A center can be anywhere, and not necessarily a point in S.
Ties are possible and must be broken so that each point lies in
exactly one cluster. In general, ties may be broken arbitrarily, although
we’ll need the property that we never change which cluster a point x is
assigned to unless the distance from x to its new cluster center is strictly
smaller than the distance from x to its old cluster center. That is, if the
current cluster has a center that is one of the closest cluster centers to x,
then don’t change which cluster x is assigned to.
The k-means problem is then the following: given a set S of n points and a positive integer k, find a sequence C = 〈c(1), c(2), …, c( k)〉 of k center points minimizing the sum f( S, C) of the squared distance from each point to its nearest center, where
In the second line, the k-clustering 〈 S(1), S(2),…, S( k)〉 is defined by the centers C and the nearest-center rule. See Exercise 33.1-1 for an
alternative formulation based on pairwise interpoint distances.
Is there a polynomial-time algorithm for the k-means problem?
Probably not, because it is NP-hard [310]. As we’ll see in Chapter 34, NP-hard problems have no known polynomial-time algorithm, but
nobody has ever proven that polynomial-time algorithms for NP-hard
problems cannot exist. Although we know of no polynomial-time
algorithm that finds the global minimum over all clusterings (according
to equation (33.2)), we can find a local minimum.
Lloyd [304] proposed a simple procedure that finds a sequence C of k centers that yields a local minimum of f( S, C). A local minimum in the k-means problem satisfies two simple properties: each cluster has an
optimal center (defined below), and each point is assigned to the cluster
(or one of the clusters) with the closest center. Lloyd’s procedure finds a
good clustering—possibly optimal—that satisfies these two properties.
These properties are necessary, but not sufficient, for optimality.
Optimal center for a given cluster
In an optimal solution to the k-means problem, each center point must
be the centroid, or mean, of the points in its cluster. The centroid is a d-
dimensional point, where the value in each dimension is the mean of the
values of all the points in the cluster in that dimension (that is, the mean








of the corresponding attribute values in the cluster). That is, if c(ℓ) is the
centroid for cluster S(ℓ), then for attributes a = 1, 2, …, d, we have Over all attributes, we write
Theorem 33.1
Given a nonempty cluster S( ℓ ), its centroid (or mean) is the unique choice for the cluster center c(ℓ) ∈ ℝ d that minimizes
Proof We wish to minimize, by choosing c(ℓ) ∈ ℝ d, the sum
For each attribute a, the term summed is a convex quadratic function in
. To minimize this function, take its derivative with respect to and
set it to 0:
or, equivalently,
Since the minimum is obtained uniquely when each coordinate of is
the average of the corresponding coordinate for x ∈ S(ℓ), the overall
minimum is obtained when c(ℓ) is the centroid of the points x, as in
equation (33.3).
▪
Optimal clusters for given centers
The following theorem shows that the nearest-center rule—assigning
each point x to one of the clusters whose center is nearest to x—yields
an optimal solution to the k-means problem.
Theorem 33.2
Given a set S of n points and a sequence 〈c(1), c(2), …, c( k)〉 of k centers, a clustering 〈 S(1), S(2), …, S( k)〉 minimizes
if and only if it assigns each point x ∈ S to a cluster S(ℓ) that minimizes Δ(x, c(ℓ)).
Proof The proof is straightforward: each point x ∈ S contributes exactly once to the sum (33.4), and choosing to put x in a cluster whose
center is nearest minimizes the contribution from x.
▪
Lloyd’s procedure
Lloyd’s procedure just iterates two operations—assigning points to
clusters based on the nearest-center rule, followed by recomputing the
centers of clusters to be their centroids—until the results converge. Here
is Lloyd’s procedure:
Input: A set S of points in ℝ d, and a positive integer k.
Output: A k-clustering 〈 S(1), S(2), …, S( k)〉 of S with a sequence of centers 〈c(1), c(2), …, c( k)〉.
1. Initialize centers: Generate an initial sequence 〈c(1), c(2), …, c( k)〉 of k centers by picking k points independently from S at random. (If the points are not necessarily distinct, see Exercise
33.1-3.) Assign all points to cluster S(1) to begin.
2. Assign points to clusters: Use the nearest-center rule to define the
clustering 〈 S(1), S(2), …, S( k)〉. That is, assign each point x ∈ S
to a cluster S(ℓ) having a nearest center (breaking ties arbitrarily,
but not changing the assignment for a point x unless the new
cluster center is strictly closer to x than the old one).
3. Stop if no change: If step 2 did not change the assignments of
any points to clusters, then stop and return the clustering 〈 S(1),
S(2), …, S( k)〉 and the associated centers 〈c(1), c(2), …, c( k)〉.
Otherwise, go to step 4.
4. Recompute centers as centroids: For ℓ = 1, 2, …, k, compute the
center c(ℓ) of cluster S(ℓ) as the centroid of the points in S(ℓ). (If
S(ℓ) is empty, let c(ℓ) be the zero vector.) Then go to step 2.
It is possible for some of the clusters returned to be empty, particularly
if many of the input points are identical.
Lloyd’s procedure always terminates. By Theorem 33.1, recomputing
the centers of each cluster as the cluster centroid cannot increase f( S, C). Lloyd’s procedure ensures that a point is reassigned to a different cluster only when such an operation strictly decreases f( S, C). Thus each iteration of Lloyd’s procedure, except the last iteration, must strictly
decrease f( S, C). Since there are only a finite number of possible k-
clusterings of S (at most kn), the procedure must terminate.
Furthermore, once one iteration of Lloyd’s procedure yields no decrease
in f, further iterations would not change anything, and the procedure
can stop at this locally optimum assignment of points to clusters.
If Lloyd’s procedure really required kn iterations, it would be
impractical. In practice, it sometimes suffices to terminate the procedure
when the percentage decrease in f( S, C) in the latest iteration falls below
a predetermined threshold. Because Lloyd’s procedure is guaranteed to find only a locally optimal clustering, one approach to finding a good
clustering is to run Lloyd’s procedure many times with different
randomly chosen initial centers, taking the best result.
The running time of Lloyd’s procedure is proportional to the number
T of iterations. In one iteration, assigning points to clusters based on the nearest-center rule requires O( dkn) time, and recomputing new centers for each cluster requires O( dn) time (because each point is in one cluster). The overall running time of the k-means procedure is thus
O( Tdkn).
Lloyd’s algorithm illustrates an approach common to many
machine-learning algorithms:
First, define a hypothesis space in terms an appropriate sequence
θ of parameters, so that each θ is associated with a specific hypothesis hθ. (For the k-means problem, θ is a dk-dimensional vector, equivalent to C, containing the d-dimensional center of each of the k clusters, and hθ is the hypothesis that each data point x should be grouped with a cluster having a center closest to
x.)
Second, define a measure f( E, θ) describing how poorly hypothesis hθ fits the given training data E. Smaller values of f( E, θ) are better, and a (locally) optimal solution (locally) minimizes f( E, θ).
(For the k-means problem, f( E, θ) is just f( S, C).) Third, given a set of training data E, use a suitable optimization
procedure to find a value of θ* that minimizes f( E, θ*), at least locally. (For the k-means problem, this value of θ* is the sequence
C of k center points returned by Lloyd’s algorithm.)
Return θ* as the answer.
In this framework, we see that optimization becomes a powerful tool for
machine learning. Using optimization in this way is flexible. For
example, regularization terms can be incorporated in the function to be
minimized, in order to penalize hypotheses that are “too complicated”
and that “overfit” the training data. (Regularization is a complex topic that isn’t pursued further here.)
Examples
Figure 33.1 demonstrates Lloyd’s procedure on a set of n = 49 cities: 48
U.S. state capitals and the District of Columbia. Each city has d = 2
dimensions: latitude and longitude. The initial clustering in part (a) of
the figure has the initial cluster centers arbitrarily chosen as the capitals
of Arkansas, Kansas, Louisiana, and Tennessee. As the procedure
iterates, the value of the function f decreases, until the 11th iteration in
part (l), where it remains the same as in the 10th iteration in part (k).
Lloyd’s procedure then terminates with the clusters shown in part (l).
As Figure 33.2 shows, Lloyd’s procedure can also apply to “vector quantization.” Here, the goal is to reduce the number of distinct colors
required to represent a photograph, thereby allowing the photograph to
be greatly compressed (albeit in a lossy manner). In part (a) of the
figure, an original photograph 700 pixels wide and 500 pixels high uses
24 bits (three bytes) per pixel to encode a triple of red, green, and blue
(RGB) primary color intensities. Parts (b)–(e) of the figure show the
results of using Lloyd’s procedure to compress the picture from a initial
space of 224 possible values per pixel to a space of only k = 4, k = 16, k
= 64, or k = 256 possible values per pixel; these k values are the cluster centers. The photograph can then be represented with only 2, 4, 6, or 8
bits per pixel, respectively, instead of the 24-bits per pixel needed by the
initial photograph. An auxiliary table, the “palette,” accompanies the
compressed image; it holds the k 24-bit cluster centers and is used to map each pixel value to its 24-bit cluster center when the photo is
decompressed.
Exercises
33.1-1
Show that the objective function f( S, C) of equation (33.2) may be alternatively written as

33.1-2
Give an example in the plane with n = 4 points and k = 2 clusters where an iteration of Lloyd’s procedure does not improve f( S, C), yet the k-
clustering is not optimal.
33.1-3
When the input to Lloyd’s procedure contains many repeated points, a
different initialization procedure might be used. Describe a way to pick
a number of centers at random that maximizes the number of distinct
centers picked. ( Hint: See Exercise 5.3-5.)
33.1-4
Show how to find an optimal k-clustering in polynomial time when
there is just one attribute ( d = 1).
Figure 33.2 Using Lloyd’s procedure for vector quantization to compress a photo by using fewer colors. (a) The original photo has 350,000 pixels (700 × 500), each a 24-bit RGB (red/blue/green) triple of 8-bit values; these pixels (colors) are the “points” to be clustered. Points repeat, so there are only 79,083 distinct colors (less than 224). After compression, only k distinct colors are used, so each pixel is represented by only ⌈1g k⌉ bits instead of 24. A “palette” maps these values back to 24-bit RGB values (the cluster centers). (b)–(e) The same photo with k = 4, 16, 64, and 256 colors. (Photo from standuppaddle, pixabay.com.)
33.2 Multiplicative-weights algorithms
This section considers problems that require you to make a series of
decisions. After each decision you receive feedback as to whether your
decision was correct. We will study a class of algorithms that are called
multiplicative-weights algorithms. This class of algorithms has a wide
variety of applications, including game playing in economics,
approximately solving linear-programming and multicommodity-flow
problems, and various applications in online machine learning. We
emphasize the online nature of the problem here: you have to make a
sequence of decisions, but some of the information needed to make the
i th decision appears only after you have already made the ( i – 1)st decision. In this section, we look at one particular problem, known as
“learning from experts,” and develop an example of a multiplicative-
weights algorithm, called the weighted-majority algorithm.
Suppose that a series of events will occur, and you want to make
predictions about these events. For example, over a series of days, you


want to predict whether it is going to rain. Or perhaps you want to
predict whether the price of a stock will increase or decrease. One way
to approach this problem is to assemble a group of “experts” and use