their collective wisdom in order to make good predictions. Let’s denote

the experts, n of them, by E 1, E 2, …, En, and let’s say that T events are going to take place. Each event has an outcome of either 0 or 1, with

o( t) denoting the outcome of the t th event. Before event t, each expert E( i) makes a prediction

. You, as the “learner,” then take the

set of n expert predictions for event t and produce a single prediction p( t) ∈ {0, 1} of your own. You base your prediction only on the predictions of the experts and anything you have learned about the

experts from their previous predictions. You do not use any additional

information about the event. Only after making your prediction do you

ascertain the outcome o( t) of event t. If your prediction p( t) matches o( t), then you were correct; otherwise, you made a mistake. The goal is to minimize the total number m of mistakes, where

.

You can also keep track of the number of mistakes each expert makes:

expert Ei makes mi mistakes, where

.

For example, suppose that you are following the price of a stock, and

each day you decide whether to invest in it for just that day by buying it

at the beginning of the day and selling it at the end of the day. If, on

some day, you buy the stock and it goes up, then you made the correct

decision, but if the stock goes down, then you made a mistake.

Similarly, if on some day, you do not buy the stock and it goes down,

then you made the correct decision, but if the stock goes up, then you

made a mistake. Since you would like to make as few mistakes as

possible, you use the advice of the experts to make your decisions.

We’ll assume nothing about the movement of the stock. We’ll also

assume nothing about the experts: the experts’ predictions could be

correlated, they could be chosen to deceive you, or perhaps some are

not really experts after all. What algorithm would you use?

Before designing an algorithm for this problem, we need to consider

what is a fair way to evaluate our algorithm. It is reasonable to expect

that our algorithm performs better when the expert predictions are better, and that it performs worse when the expert predictions are worse.

The goal of the algorithm is to limit the number of mistakes you make

to be close to the number of mistakes that the best of the experts makes.

At first, this goal might seem impossible, because you do not know until

the end which expert is best. We’ll see, however, that by taking the

advice provided by all the experts into account, you can achieve this

goal. More formally, we use the notion of “regret,” which compares our

algorithm to the performance of the best expert (in hindsight) over all.

Letting m* = min { mi : 1 ≤ in} denote the number of mistakes made by the best expert, the regret is mm*. The goal is to design an algorithm with low regret. (Regret can be negative, although it typically

isn’t, since it is rare that you do better than the best expert.)

As a warm-up, let’s consider the case in which one of the experts

makes a correct prediction each time. Even without knowing who that

expert is, you can still achieve good results.

Lemma 33.3

Suppose that out of n experts, there is one who always makes the correct

prediction for all T events. Then there is an algorithm that makes at most ⌈1g n⌉ mistakes.

Proof The algorithm maintains a set S consisting of experts who have

not yet made a mistake. Initially, S contains all n experts. The algorithm’s prediction is always the majority vote of the predictions of

the experts remaining in set S. In case of a tie, the algorithm makes any

prediction. After each outcome is learned, set S is updated to remove all

the experts who made an incorrect prediction about that outcome.

We now analyze the algorithm. The expert who always makes the

correct prediction will always be in set S. Every time the algorithm makes a mistake, at least half of the experts who were still in S also make a mistake, and these experts are removed from S. If S′ is the set of experts remaining after removing those who made a mistake, we have

that | S′| ≤ | S|/2. The size of S can be halved at most ⌈1g n⌉ times until | S|

= 1. From this point on, we know that the algorithm never makes a

mistake, since the set S consists only of the one expert who never makes

Image 1463

Image 1464

Image 1465

a mistake. Therefore, overall the algorithm makes at most ⌈1g n

mistakes.

Exercise 33.2-1 asks you to generalize this result to the case when

there is no expert who makes perfect predictions and show that, for any

set of experts, there is an algorithm that makes at most m* ⌈1g n

mistakes. The generalized algorithm begins in the same way. The set S

might become empty at some point, however. If that ever happens, reset

S to contain all the experts and continue the algorithm.

You can substantially improve your prediction ability by not just

tracking which experts have not made any mistakes, or have not made

any mistakes recently, to a more nuanced evaluation of the quality of

each expert. The key idea is to use the feedback you receive to update

your evaluation of how much trust to put in each expert. As the experts

make predictions, you observe whether they were correct and decrease

your confidence in the experts who make more mistakes. In this way,

you can learn over time which experts are more reliable and which are

less reliable, and weight their predictions accordlingly. The change in

weights is accomplished via multiplication, hence the term

“multiplicative weights.”

The algorithm appears in the procedure WEIGHTED-MAJORITY

on the following page, which takes a set E = { E 1, E 2, …, En} of experts, a number T of events, the number n of experts, and a parameter 0 < γ

1/2 that controls how the weights change. The algorithm maintains

weights for i = 1, 2, …, n and t = 1, 2, …, T, where

. The

for loop of lines 1–2 sets the initial weights

to 1, capturing the idea

that with no knowledge, you trust each expert equally. Each iteration of

the main for loop of lines 3–18 does the following for an event t = 1, 2,

…, T. Each expert Ei makes a prediction for event t in line 4. Lines 5–8

compute upweight( t), the sum of the weights of the experts who predict

1 for event t, and downweight( t), the sum of the weights of the experts who predict 0 for the event. Lines 9–11 decide the algorithm’s prediction

p( t) for event t based on whichever weighted sum is larger (breaking ties in favor of deciding 1). The outcome of event t is revealed in line 12.

Image 1466

Image 1467

Image 1468

Image 1469

Image 1470

Image 1471

Image 1472

Image 1473

Image 1474

Image 1475

Finally, lines 14–17 decrease the weights of the experts who made an

incorrect prediction for event t by multiplying their weights by 1 – γ, leaving alone the weights of the experts who correctly predicted the

event’s outcome. Thus, the fewer mistakes each expert makes, the higher

that expert’s weight.

The WEIGHTED-MAJORITY procedure doesn’t do much worse

than any expert. In particular, it doesn’t do much worse than the best

expert. To quantify this claim, let m( t) be the number of mistakes made by the procedure through event t, and let

be the number of mistakes

made by expert Ei through event t. The following theorem is the key.

WEIGHTED-MAJORITY( E, T, n, γ)

1 for i = 1 to n

2

// trust each expert equally

3 for t = 1 to T

4

each expert EiE makes a prediction

5

// experts who predicted 1

6

// sum of weights of who predicted 1

7

// experts who predicted 0

8

// sum of weights of who predicted 0

9

if upweight( t) ≥ downweight( t)

10

p( t) = 1

// algorithm predicts 1

11

else p( t) = 0

// algorithm predicts 0

12

outcome o( t) is revealed

13

// If p( t) ≠ o( t), the algorithm made a mistake.

14

for i = 1 to n

15

if

// if expert E( i) made a mistake …

16

// … then decrease that expert’s weight

17

else

18

return p( t)

Image 1476

Image 1477

Image 1478

Image 1479

Image 1480

Theorem 33.4

When running WEIGHTED-MAJORITY, we have, for every expert Ei

and every event T′ ≤ T,

Proof Every time an expert Ei makes a mistake, its weight, which is

initially 1, is multiplied by 1 – γ, and so we have

for t = 1, 2, …, T.

We use a potential function

, summing the weights for

all n experts after iteration t of the for loop of lines 3–18. Initially, we have W(0) = n since all n weights start out with the value 1. Because each expert belongs to either the set U or the set D (defined in lines 5

and 7 of WEIGHTED-MAJORITY), we always have W( t) =

upweight( t) + downweight( t) after each execution of line 8.

Consider an iteration t in which the algorithm makes a mistake in its

prediction, which means that either the algorithm predicts 1 and the

outcome is 0 or the algorithm predicts 0 and the outcome is 1. Without

loss of generality, assume that the algorithm predicts 1 and the outcome

is 0. The algorithm predicted 1 because upweight( t) ≥ downweight( t) in line 9, which implies that

Each expert in U then has its weight multiplied by 1 – γ, and each expert in D has its weight unchanged. Thus, we have

Image 1481

Image 1482

Image 1483

Image 1484

Image 1485

Image 1486

Image 1487

Therefore, for every iteration t in which the algorithm makes a mistake,

we have

In an iteration where the algorithm does not make a mistake, some of

the weights decrease and some remain unchanged, so that we have

Since there are m( T′) mistakes made through iteration T′, and W(1) = n, we can repeatedly apply inequality (33.8) to iterations where the

algorithm makes a mistake and inequality (33.9) to iterations where the

algorithm does not make a mistake, obtaining

Because the function W is the sum of the weights and all weights are

positive, its value exceeds any single weight. Therefore, using equation

(33.6) we have, for any expert Ei and for any iteration T′ ≤ T, Combining inequalities (33.10) and (33.11) gives

Taking the natural logarithm of both sides yields

We now use the Taylor series expansion to derive upper and lower

bounds on the logarithmic factors in inequality (33.12). The Taylor

series for ln(1+ x) is given in equation (3.22) on page 67. Substituting − x

for x, we have that for 0 < x ≤ 1/2,

Since each term on the right-hand side is negative, we can drop all terms

except the first and obtain an upper bound of ln(1 – x) ≤ − x. Since 0 < γ

≤ 1/2, we have

Image 1488

Image 1489

Image 1490

Image 1491

Image 1492

Image 1493

Image 1494

Image 1495

Image 1496

For the lower bound, Exercise 33.2-2 asks you to show that ln(1 – x) ≥

xx 2 when 0 < x ≤ 1/2, so that

Thus, we have

so that

Subtracting ln n from both sides of inequality (33.16) and then

multiplying both sides by −2/ γ yields

, thus

proving the theorem.

Theorem 33.4 applies to any expert and any event T′ ≤ T. In particular, we can compare against the best expert after all events have

occurred, producing the following corollary.

Corollary 33.5

At the end of procedure WEIGHTED-MAJORITY, we have

Let’s explore this bound. Assuming that

, we can

choose

and plug into inequality (33.17) to obtain

and so the number of errors is at most twice the number of errors made by the best expert plus a term that is often slower growing than m*.

Exercise 33.2-4 shows that you can decrease the bound on the number

of errors by a factor of 2 by using randomization, which leads to much

stronger bounds. In particular, the upper bound on regret ( mm*) is

reduced from (1 + 2 γ) m* + (2 ln n)/ γ to an expected value of ϵm* + (ln n)/ ϵ, where both γ and ϵ are at most 1/2. Numerically, we can see that if γ = 1/2, WEIGHTED-MAJORITY makes at most 3 times the number

of errors as the best expert, plus 4 ln n errors. As another example, suppose that T = 1000 predictions are being made by n = 20 experts, and the best expert is correct 95% of the time, making 50 errors. Then

WEIGHTED-MAJORITY makes at most 100(1+ γ)+2 ln 20/ γ errors.

By choosing γ = 1/4, WEIGHTED-MAJORITY makes at most 149

errors, or a success rate of at least 85%.

Multiplicative weights methods typically refer to a broader class of

algorithms that includes WEIGHTED-MAJORITY. The outcomes

and predictions need not be only 0 or 1, but can be real numbers, and

there can be a loss associated with a particular outcome and prediction.

The weights can be updated by a multiplicative factor that depends on

the loss, and the algorithm can, given a set of weights, treat them as a

distribution on experts and use them to choose an expert to follow in

each event. Even in these more general settings, bounds similar to

Theorem 33.4 hold.

Exercises

33.2-1

The proof of Lemma 33.3 assumes that some expert never makes a

mistake. It is possible to generalize the algorithm and analysis to

remove this assumption. The new algorithm begins in the same way.

The set S might become empty at some point, however. If that ever

happens, reset S to contain all the experts and continue the algorithm.

Show that the number of mistakes that this algorithm makes is at most

m* ⌈1g n⌉.

33.2-2

Show that ln(1 – x) ≥ − xx 2 when 0 < x ≤ 1/2. ( Hint: Start with equation (33.13), group all the terms after the first three, and use

equation (A.7) on page 1142.)

33.2-3

Consider a randomized variant of the algorithm given in the proof of

Lemma 33.3, in which some expert never makes a mistake. At each step,

choose an expert Ei uniformly at random from the set S and then make

the same predication as Ei. Show that the expected number of mistakes

made by this algorithm is ⌈1g n⌉.

33.2-4

Consider a randomized version of WEIGHTED-MAJORITY. The

algorithm is the same, except for the prediction step, which interprets

the weights as a probability distribution over the experts and chooses an

expert Ei according to that distribution. It then chooses its prediction to

be the same as the prediction made by expert Ei. Show that, for any 0 <

ϵ < 1/2, the expected number of mistakes made by this algorithm is at

most (1 + ϵ) m* + (ln n)/ ϵ.

33.3 Gradient descent

Suppose that you have a set { p 1, p 2, …, pn} of points and you want to find the line that best fits these points. For any line ℓ, there is a distance

di between each point pi and the line. You want to find the line that minimizes some function f( d 1, …, dn). There are many possible choices for the definition of distance and for the function f. For example, the distance can be the projection distance to the line and the function can

be the sum of the squares of the distances. This type of problem is

common in data science and machine learning—the line is the

hypothesis that best describes the data—where the particular definition

of best is determined by the definition of distance and the objective f. If

the definition of distance and the function f are linear, then we have a

linear-programming problem, as discussed in Chapter 29. Although the

Image 1497

linear-programming framework captures several important problems,

many other problems, including various machine-learning problems,

have objectives and constraints that are not necessarily linear. We need

frameworks and algorithms to solve such problems.

In this section, we consider the problem of optimizing a continuous

function and discuss one of the most popular methods to do so:

gradient descent. Gradient descent is a general method for finding a

local minimum of a function f : ℝ n → ℝ, where informally, a local minimum of a function f is a point x for which f(x) ≤ f(x′) for all x′ that are “near” x. When the function is convex, it can find a point near the

global minimizer of f: an n-vector argument x = ( x 1, x 2, …, xn) such that f(x) is minimum. For the intuitive idea behind gradient descent, imagine being in a landscape of hills and valleys, and wanting to get to a

low point as quickly as possible. You survey the terrain and choose to

move in the direction that takes you downhill the fastest from your

current position. You move in that direction, but only for a short while,

because as you proceed, the terrain changes and you might need to

choose a different direction. So you stop, reevaluate the possible

directions and move another short distance in the steepest downhill

direction, which might differ from the direction of your previous

movement. You continue this process until you reach a point from

which all directions lead up. Such a point is a local minimum.

In order to make this informal procedure more formal, we need to

define the gradient of a function, which in the analogy above is a

measure of the steepness of the various directions. Given a function f :

n → ℝ, its gradientf is a function ∇ f : ℝ n → ℝ n comprising n partial derivatives:

. Analogous to the derivative of a

function of a single variable, the gradient can be viewed as a direction in

which the function value locally increases the fastest, and the rate of

that increase. This view is informal; in order to make it formal we would

have to define what local means and place certain conditions, such as

continuity or existence of derivatives, on the function. Nevertheless, this

view motivates the key step of gradient descent—move in the direction

opposite to the gradient, by a distance influenced by the magnitude of the gradient.

The general procedure of gradient descent proceeds in steps. You

start at some initial point x(0), which is an n-vector. At each step t, you compute the value of the gradient of f at point x( t), that is, (∇ f)(x( t)), which is also an n-vector. You then move in the direction opposite to the

gradient in each dimension at x( t) to arrive at the next point x( t+1), which again is an n-vector. Because you moved in a monotonically

decreasing direction in each dimension, you should have that f(x( t+1)) ≤

f(x( t)). Several details are needed to turn this idea into an actual algorithm. The two main details are that you need an initial point and

that you need to decide how far to move in the direction of the negative

gradient. You also need to understand when to stop and what you can

conclude about the quality of the solution found. We will explore these

issues further in this section, for both constrained minimization, where

there are additional constraints on the points, and unconstrained

minimization, where there are none.

Unconstrained gradient descent

In order to gain intuition, let’s consider unconstrained gradient descent

in just one dimension, that is, when f is a function of a scalar x, so that f

: ℝ → ℝ. In this case, the gradient ∇ f of f is just f′( x), the derivative of f with respect to x. Consider the function f shown in blue in Figure 33.3, with minimizer x* and starting point x(0). The gradient (derivative) f

( x(0)), shown in orange, has a negative slope, so that a small step from

x(0) in the direction of increasing x results in a point x′ for which f( x′) < f( x(0)). Too large a step, however, results in a

Image 1498

Image 1499

Image 1500

Figure 33.3 A function f : ℝ → ℝ, shown in blue. Its gradient at point x(0), in orange, has a negative slope, and so a small increase in x from x(0) to x′ results in f( x′) < f( x(0)). Small increases in x from x(0) head toward , which gives a local minimum. Too large an increase in x can end up at x″, where f( x″) > f( x(0)). Small steps starting from x(0) and going only in the direction of decreasing values of f cannot end up at the global minimizer x*.

point x″ for which f( x″) > f( x(0)), so this is a bad idea. Restricting ourselves to small steps, where each one has f( x′) < f( x), eventually results in getting close to point , which gives a local minimum. By

taking only small downhill steps, however, gradient descent has no

chance to get to the global minimizer x*, given the starting point x(0).

We draw two observations from this simple example. First, gradient

descent converges toward a local minimum, and not necessarily a global

minimum. Second, the speed at which it converges and how it behaves

are related to properties of the function, to the initial point, and to the

step size of the algorithm.

The procedure GRADIENT-DESCENT on the facing page takes as

input a function f, an initial point x(0) ∈ ℝ n, a fixed step-size multiplier γ > 0, and a number T > 0 of steps to take. Each iteration of the for loop of lines 2–4 performs a step by computing the n-dimensional gradient at

point x( t) and then moving distance γ in the opposite direction in the n-

dimensional space. The complexity of computing the gradient depends

on the function f and can sometimes be expensive. Line 3 sums the

points visited. After the loop terminates, line 6 returns x-avg, the

average of all the points visited except for the last one, x( T). It might seem more natural to return x( T), and in fact, in many circumstances,

you might prefer to have the function return x( T). For the version we

will analyze, however, we use x-avg.

GRADIENT-DESCENT( f, x(0), γ, T)

1 sum = 0

// n-dimensional vector, initially all 0

2 for t = 0 to T – 1

3

sum = sum + x( t)

// add each of n dimensions into sum

4

x( t+1) = x( t) – γ · (∇ f)(x( t)) // (∇ f)(x( t)), x( t+1) are n-

dimensional

5 x-avg = sum/ T

// divide each of n dimensions by T

6 return x-avg

Figure 33.4 depicts how gradient descent ideally runs on a convex 1-dimensional function.1 We’ll define convexity more formally below, but the figure shows that each iteration moves in the direction opposite to

the gradient, with the distance moved being proportional to the

magnitude of the gradient. As the iterations proceed, the magnitude of

the gradient decreases, and thus the distance moved along the

horizontal axis decreases. After each iteration, the distance to the

optimal point x* decreases. This ideal behavior is not guaranteed to

occur in general, but the analysis in the remainder of this section

formalizes when this behavior occurs and quantifies the number of

iterations needed. Gradient descent does not always work, however. We

have already seen that if the function is not convex, gradient descent can

converge to a local, rather than global, minimum. We have also seen

that if the step size is too large, GRADIENT-DESCENT can overshoot

the minimum and wind up farther away. (It is also possible to overshoot

the minimum and wind up closer to the optimum.)

Analysis of unconstrained gradient descent for convex functions

Image 1501

Image 1502

Our analysis of gradient descent focuses on convex functions. Inequality

(C.29) on page 1194 defines a convex function of one variable, as shown

in Figure 33.5. We can extend that definition to a function f : ℝ n → ℝ

and say that f is convex if for all x, y ∈ ℝ n and for all 0 ≤ λ ≤ 1, we have (Inequalities (33.18) and (C.29) are the same, except for the dimensions

of x and y.) We also assume that our convex functions are closed2 and differentiable.

Figure 33.4 An example of running gradient descent on a convex function f : ℝ → ℝ, shown in blue. Beginning at point x(0), each iteration moves in the direction opposite to the gradient, and the distance moved is proportional to the magnitude of the gradient. Orange lines represent the negative of the gradient at each point, scaled by the step size γ. As the iterations proceed, the magnitude of the gradient decreases, and the distance moved decreases correspondingly. After each iteration, the distance to the optimal point x* decreases.

Image 1503

Figure 33.5 A convex function f : ℝ → ℝ, shown in blue, with local and global minimizer x*.

Because f is convex, f( λ x + (1 – λ)y) ≤ λf(x) + (1 – λ) f(y) for any two values x and y and all 0 ≤ λ

1, shown for a particular value of λ. Here, the orange line segment represents all values λf(x) + (1

λ) f(y) for 0 ≤ λ ≤ 1, and it is above the blue line.

A convex function has the property that any local minimum is also a

global minimum. To verify this property, consider inequality (33.18),

and suppose for the purpose of contradiction that x is a local minimum

but not a global minimum and y ≠ x is a global minimum, so f(y) < f(x).

Then we have

f( λ x + (1 – λ)y) ≤ λf(x) + (1 – λ) f(y) (by inequality (33.18))

< λf(x) + (1 – λ) f(x)

= f(x).

Thus, letting approach 1, we see that there is another point near x, say

x′, such that f(x′) < f(x), so x is not a local minimum.

Convex functions have several useful properties. The first property,

whose proof we leave as Exercise 33.3-1, says that a convex function

always lies above its tangent hyperplane. In the context of gradient

descent, angle brackets denote the notation for inner product defined on

page 1219 rather than denoting a sequence.

Lemma 33.6

For any convex differentiable function f : ℝ n → ℝ and for all x, y ∈ ℝ n, we have ≤ f(x) ≤ f(y) + 〈(∇ f)(x), x – y〉.

The second property, which Exercise 33.3-2 asks you to prove, is a

repeated application of the definition of convexity in inequality (33.18).

Image 1504

Image 1505

Image 1506

Lemma 33.7

For any convex function f : ℝ n → ℝ, for any integer T ≥ 1, and for all x(0), …, x( T–1) ∈ ℝ n, we have

The left-hand side of inequality (33.19) is the value of f at the vector

x-avg that GRADIENT-DESCENT returns.

We now proceed to analyze GRADIENT-DESCENT. It might not

return the exact global minimizer x*. We use an error bound ϵ, and we

want to choose T so that f(x-avg) – f(x*) ≤ ϵ at termination. The value of ϵ depends on the number T of iterations and two additional values.

First, since you expect it to be better to start close to the global

minimizer, ϵ is a function of

the euclidean norm (or distance, defined on page 1219) of the difference

between x(0) and x*. The error bound ϵ is also a function of a quantity

we call L, which is an upper bound on the magnitude ∥(∇ f)(x)∥ of the

gradient, so that

where x ranges over all the points x(0), …, x( T–1) whose gradients are

computed by GRADIENT-DESCENT. Of course, we don’t know the

values of L and R, but for now let’s assume that we do. We’ll discuss later how to remove these assumptions. The analysis of GRADIENT-DESCENT is summarized in the following theorem.

Theorem 33.8

Let x* ∈ ℝ n be the minimizer of a convex function f, and suppose that

an execution of GRADIENT-DESCENT( f, x(0), γ, T) returns x-avg,

Image 1507

Image 1508

Image 1509

Image 1510

Image 1511

Image 1512

Image 1513

where

and R and L are defined in equations (33.20) and

(33.21). Let

. Then we have f(x-avg) – f(x*) ≤ ϵ.

We now prove this theorem. We do not give an absolute bound on

how much progress each iteration makes. Instead, we use a potential

function, as in Section 16.3. Here, we define a potential Φ( t) after computing x( t), such that Φ( t) ≥ 0 for t = 0, …, T. We define the amortized progress in the iteration that computes x( t) as

Along with including the change in potential (Φ( t + 1) – Φ( t)), equation (33.22) also subtracts the minimum value f(x*) because ultimately, you

care not about the values f(x( t)) but about how close they are to f(x*).

Suppose that we can show that p( t) ≤ B for some value B and t = 0, …, T – 1. Then we can substitute for p( t) using equation (33.22), giving Summing inequality (33.23) over t = 0, …, T – 1 yields

Observing that we have a telescoping series on the right and regrouping

terms, we have that

Dividing by T and dropping the positive term Φ( T) gives

and thus we have

Image 1514

Image 1515

Image 1516

Image 1517

Image 1518

In other words, if we can show that p( t) ≤ B for some value B and choose a potential function where Φ(0) is not too large, then inequality

(33.25) tells us how close the function value f(x-avg) is to the function

value f(x*) after T iterations. That is, we can set the error bound ϵ to B

+ Φ(0)/ T.

In order to bound the amortized progress, we need to come up with

a concrete potential function. Define the potential function Φ( t) by

that is, the potential function is proportional to the square of the

distance between the current point and the minimizer x*. With this

potential function in hand, the next lemma provides a bound on the

amortized progress made in any iteration of GRADIENT-DESCENT.

Lemma 33.9

Let x* ∈ ℝ n be the minimizer of a convex function f, and consider an

execution of GRADIENT-DESCENT( f, x(0), γ, T). Then for each point x( t) computed by the procedure, we have that

Proof We first bound the potential change Φ( t + 1) – Φ( t). Using the definition of Φ( t) from equation (33.26), we have

From line 4 in GRADIENT-DESCENT, we know that

Image 1519

Image 1520

Image 1521

Image 1522

Image 1523

and so we would like to rewrite equation (33.27) to have x( t+1) – x( t)

terms. As Exercise 33.3-3 asks you to prove, for any two vectors a, b ∈

n, we have

Letting a = x( t) – x* and b = x( t+1) – x( t), we can write the right-hand side of equation (33.27) as

. Then we can express the

potential change as

and thus we have

We can now proceed to bound p( t). By the bound on the potential

change from inequality (33.31), and using the definition of L (inequality

(33.21)), we have

sult in the following theorem

Image 1524

Image 1525

Image 1526

Image 1527

Image 1528

Having bounded the amortized progress in one step, we now analyze

the entire GRADIENT-DESCENT procedure, completing the proof of

Theorem 33.8.

Proof of Theorem 33.8 Inequality (33.25) tells us that if we have an upper bound of B for p( t), then we also have the bound f(x-avg) – f(x*)

B + Φ(0)/ T. By equations (33.20) and (33.26), we have that Φ(0) =

R 2/(2 γ). Lemma 33.9 gives us the upper bound of B = γL 2/2, and so we have

Our choice of

in the statement of Theorem 33.8 balances

the two terms, and we obtain

Since we chose

in the theorem statement, the proof is

complete.

Continuing under the assumption that we know R (from equation

(33.20)) and L (from inequality (33.21)), we can think of the analysis in

a slightly different way. We can presume that we have a target accuracy

ϵ and then compute the number of iterations needed. That is, we can

solve

for T, obtaining T = R 2 L 2/ ϵ 2. The number of iterations thus depends on the square of R and L and, most

importantly, on 1/ ϵ 2. (The definition of L from inequality (33.21) depends on T, but we may know an upper bound on L that doesn’t

depend on the particular value of T.) Thus, if you want to halve your error bound, you need to run four times as many iterations.

It is quite possible that we don’t really know R and L, since you’d

need to know x* in order to know R (since R = ∥x(0) – x*∥), and you

might not have an explicit upper bound on the gradient, which would

provide L. You can, however, interpret the analysis of gradient descent

as a proof that there is some step size for which the procedure makes

progress toward the minimum. You can then compute a step size for

which f(x( t)) – f(x( t+1)) is large enough. In fact, not having a fixed step size multiplier can actually help in practice, as you are free to use any

step size s that achieves sufficient decrease in the value of f. You can search for a step size that achieves a large decrease via a binary-search-like routine, which is often called line search. For a given function f and step size s, define the function g(x( t), s) = f(x( t)) – s(∇ f)(x( t)). Start with a small step size s for which g(x( t), s) ≤ f(x( t)). Then repeatedly double s until g(x( t), 2 s) ≥ g(x( t), s), and then perform a binary search in the interval [ s, 2 s]. This procedure can produce a step size that achieves a significant decrease in the objective function. In other circumstances,

however, you may know good upper bounds on R and L, typically from

problem-specific information, which can suffice.

The dominant computational step in each iteration of the for loop of

lines 2–4 is computing the gradient. The complexity of computing and

evaluating a gradient varies widely, depending on the application at

hand. We’ll discuss several applications later.

Constrained gradient descent

We can adapt gradient descent for constrained minimization to

minimize a closed convex function f(x), subject to the additional

requirement that x ∈ K, where K is a closed convex body. A body K

n is convex if for all x, y ∈ K, the convex combination λ x+(1– λ)y ∈ K

for all 0 ≤ λ ≤ 1. A closed convex body contains its limit points.

Somewhat surprisingly, restricting to the constrained problem does not

significantly increase the number of iterations of gradient descent. The

idea is that you run the same algorithm, but in each iteration, check whether the current point x( t) is still within the convex body K. If it is not, just move to the closest point in K. Moving to the closest point is

known as projection. We formally define the projection ∏ K(x) of a point x in n dimensions onto a convex body K as the point y ∈ K such that ∥x

– y∥ = min {∥x – z∥ : zK}. If we have x ∈ K, then ∏ K(x) = x.

This one change yields the procedure GRADIENT-DESCENT-

CONSTRAINED, in which line 4 of GRADIENT-DESCENT is

replaced by two lines. It assumes that x(0) ∈ K. Line 4 of GRADIENT-

DESCENT-CONSTRAINED moves in the direction of the negative

gradient, and line 5 projects back onto K. The lemma that follows helps

to show that when x* ∈ K, if the projection step in line 5 moves from a

point outside of K to a point in K, it cannot be moving away from x*.

GRADIENT-DESCENT-CONSTRAINED( f, x(0), γ, T, K)

1 sum = 0

// n-dimensional vector, initially all 0

2 for t = 0 to T – 1

3

sum = sum + x( t)

// add each of n dimensions into sum

4

x( t+1) = x( t) – γ · (∇ f)(x( t)) // (∇ f)(x( t)), x( t+1) are n-

dimensional

5

x( t+1) = ∏ K(x( t+1))

// project onto K

6 x-avg = sum/ T

// divide each of n dimensions by T

7 return x-avg

Lemma 33.10

Consider a convex body K ⊆ ℝ n and points a ∈ K and b′ ∈ ℝ n. Let b =

K(b′). Then ∥b – a∥2 ≤ ∥b′ – a∥2.

Proof If b′ ∈ K, then b = b′ and the claim is true. Otherwise, b′ ≠ b, and as Figure 33.6 shows, we can extend the line segment between b and b′

to a line ℓ. Let c be the projection of a onto ℓ. Point c may or may not be

in K, and if a is on the boundary of K, then c could coincide with b. If c

Image 1529

coincides with b (part (c) of the figure), then abb′ is a right triangle, and

so ∥b – a∥2 ≤ ∥b′ – a∥2. If c does not coincide with b (parts (a) and (b) of

the figure), then because of convexity, the angle ∠abb′ must be obtuse.

Because angle ∠abb′ is obtuse, b lies between c and b′ on ℓ .

Furthermore, because c is the projection of a onto line ℓ, acb and acb′

must be right triangles. By the Pythagorean theorem, we have that ∥b′ –

a∥2 = ∥a – c∥2+∥c – b′∥2 and ∥b – a∥2 = ∥a – c∥2+∥c – b∥2. Subtracting

these two equations gives ∥b′ – a∥2 – ∥b – a∥2 = ∥c – b′∥2 – ∥c – b∥2.

Because b is between c and b′, we must have ∥c – b′∥2 ≥ ∥c – b∥2, and

thus ∥b′ – a∥2 – ∥b – a∥2 ≥ 0. The lemma follows.

Figure 33.6 Projecting a point b′ outside the convex body K to the closest point b = ∏ K(b′) in K.

Line ℓ is the line containing b and b′, and point c is the projection of a onto ℓ. (a) When c is in K.

(b) When c is not in K. (c) When a is on the boundary of K and c coincides with b.

We can now repeat the entire proof for the unconstrained case and

obtain the same bounds. Lemma 33.10 with a = x*, b = x( t+1), and b′ =

x( t+1) tells us that ∥x( t+1)–x*∥2 ≤ ∥x( t+1)–x*∥2. We can therefore derive an upper bound that matches inequality (33.31). We continue to

define Φ( t) as in equation (33.26), but noting that x( t+1), computed in line 5 of GRADIENT-DESCENT-CONSTRAINED, has a different

meaning here from in inequality (33.31):

Image 1530

Image 1531

Image 1532

With the same upper bound on the change in the potential function as

in equation (33.30), the entire proof of Lemma 33.9 can proceed as

before. We can therefore conclude that the procedure GRADIENT-

DESCENT-CONSTRAINED has the same asymptotic complexity as

GRADIENT-DESCENT. We summarize this result in the following

theorem.

Theorem 33.11

Let K ⊆ ℝ n be a convex body, x* ∈ ℝ n be the minimizer of a convex function f over K, and

, where R and L are defined in

equations (33.20) and (33.21). Suppose that the vector x-avg is returned

by an execution of GRADIENT-DESCENT-CONSTRAINED( f, x(0),

γ, T, K). Let

. Then we have f(x-avg) – f(x*) ≤ ϵ.

Applications of gradient descent

Gradient descent has many applications to minimizing functions and is

widely used in optimization and machine learning. Here we sketch how

it can be used to solve linear systems. Then we discuss an application to

machine learning: prediction using linear regression.

In Chapter 28, we saw how to use Gaussian elimination to solve a system of linear equations A x = b, thereby computing x = A−1b. If A is an n × n matrix and b is a length- n vector, then the running time of

Image 1533

Image 1534

Image 1535

Image 1536

Gaussian elimination is Θ( n 3), which for large matrices might be

prohibitively expensive. If an approximate solution is acceptable,

however, you can use gradient descent.

First, let’s see how to use gradient descent as a roundabout—and

admittedly inefficient—way to solve for x in the scalar equation ax = b, where a, x, b ∈ ℝ. This equation is equivalent to axb = 0. If axb is the derivative of a convex function f( x), then axb = 0 for the value of x that minimizes f( x). Given f( x), gradient descent can then determine this minimizer. Of course, f( x) is just the integral of axb, that is,

, which is convex if a ≥ 0. Therefore, one way to solve ax

= b for a ≥ 0 is to find the minimizer for

via gradient descent.

We now generalize this idea to higher dimensions, where using

gradient descent may actually lead to a faster algorithm. One n-

dimensional analog is the function

, where A is an n ×

n matrix. The gradient of f with respect to x is the function A x – b. To find the value of x that minimizes f, we set the gradient of f to 0 and solve for x. Solving A x–b = 0 for x, we obtain x = A−1b, Thus, minimizing f(x) is equivalent to solving A x = b. If f(x) is convex, then gradient descent can approximately compute this minimum.

A 1-dimensional function is convex when its second derivative is

positive. The equivalent definition for a multidimensional function is

that it is convex when its Hessian matrix is positive-semidefinite (see

page 1222 for a definition), where the Hessian matrix (∇2 f)(x) of a function f(x) is the matrix in which entry ( i, j) is the partial derivative of f with respect to i and j:

Analogous to the 1-dimensional case, the Hessian of f is just A, and so if A is a positive-semidefinite matrix, then we can use gradient descent to

Image 1537

Image 1538

find a point x where A x ≈ b. If R and L are not too large, then this method is faster than using Gaussian elimination.

Gradient descent in machine learning

As a concrete example of supervised learning for prediction, suppose

that you want to predict whether a patient will develop heart disease.

For each of m patients, you have n different attributes. For example, you might have n = 4 and the four pieces of data are age, height, blood pressure, and number of close family members with heart disease.

Denote the data for patient i as a vector x( i) ∈ ℝ n, with giving the j th entry in vector x( i). The label of patient i is denoted by a scalar y( i)

∈ ℝ, signifying the severity of the patient’s heart disease. The

hypothesis should capture a relationship between the x( i) values and

y( i). For this example, we make the modeling assumption that the relationship is linear, and therefore the goal is to compute the “best”

linear relationship between the x( i) values and y( i): a linear function f : ℝ n → ℝ such that f(x( i)) ≈ y( i) for each patient i. Of course, no such function may exist, but you would like one that comes as close as

possible. A linear function f can be defined by a vector of weights w =

( w 0, w 1, …, wn), with

When evaluating a machine-learning model, you need to measure

how close each value f(x( i)) is to its corresponding label y( i). In this example, we define the error e( i) ∈ ℝ associated with patient i as e( i) =

f(x( i)) – y( i). The objective function we choose is to minimize the sum of squares of the errors, which is

Image 1539

The objective function is typically called the loss function, and the least-squares error given by equation (33.33) is just one example of

many possible loss functions. The goal is then, given the x( i) and y( i) values, to compute the weights w 0, w 1, …, wn so as to minimize the loss function in equation (33.33). The variables here are the weights w 0, w 1,

…, wn and not the x( i) or y( i) values.

This particular objective is sometimes known as a least-squares fit,

and the problem of finding a linear function to fit data and minimize the

least-squares error is called linear regression. Finding a least-squares fit

is also addressed in Section 28.3.

When the function f is linear, the loss function defined in equation

(33.33) is convex, because it is the sum of squares of linear functions,

which are themselves convex. Therefore, we can apply gradient descent

to compute a set of weights to approximately minimize the least-squares

error. The concrete goal of learning is to be able to make predictions on

new data. Informally, if the features are all reported in the same units

and are from the same range (perhaps from being normalized), then the

weights tend to have a natural interpretation because the features of the

data that are better predictors of the label have a larger associated

weight. For example, you would expect that, after normalization, the

weight associated with the number of family members with heart

disease would be larger than the weight associated with height.

The computed weights form a model of the data. Once you have a

model, you can make predictions, so that given new data, you can

predict its label. In our example, given a new patient x′ who is not part

of the original training data set, you would still hope to predict the

chance that the new patient develops heart disease. You can do so by

computing the label f(x′), incorporating the weights computed by

gradient descent.

For this linear-regression problem, the objective is to minimize the expression in equation (33.33), which is a quadratic in each of the n+1

weights wj. Thus, entry j in the gradient is linear in wj. Exercise 33.3-5

asks you to explicitly compute the gradient and see that it can be

computed in O( nm) time, which is linear in the input size. Compared with the exact method of solving equation (33.33) in Chapter 28, which needs to invert a matrix, gradient descent is typically much faster.

Section 33.1 briefly discussed regularization—the idea that a complicated hypothesis should be penalized in order to avoid overfitting

the training data. Regularization often involves adding a term to the

objective function, but it can also be achieved by adding a constraint.

One way to regularize this example would be to explicitly limit the norm

of the weights, adding a constraint that ∥w∥ ≤ B for some bound B > 0.

(Recall again that the components of the vector w are the variables in

the present application.) Adding this constraint controls the complexity

of the model, as the number of values wj that can have large absolute

value is now limited.

In order to run GRADIENT-DESCENT-CONSTRAINED for any

problem, you need to implement the projection step, as well as to

compute bounds on R and L. We conclude this section by describing these calculations for gradient descent with the constraint ∥w∥ ≤ B.

First, consider the projection step in line 5. Suppose that the update in

line 4 results in a vector w′. The projection is implemented by

computing ∏ k(w′) where K is defined by ∥w∥ ≤ B. This particular projection can be accomplished by simply scaling w′, since we know that

closest point in K to w′ must be the point along the vector whose norm

is exactly B. The amount z by which we need to scale w′ to hit the boundary of K is the solution to the equation z ∥w′∥ = B, which is solved by z = B/∥w′∥. Hence line 5 is implemented by computing w =

w′ B/∥w′∥. Because we always have ∥w∥ ≤ B, Exercise 33.3-6 asks you to

show that the upper bound on the magnitude L of the gradient is O( B).

We also get a bound on R, as follows. By the constraint ∥w∥ ≤ B, we know that both ∥w(0)∥ ≤ B and ∥w*∥ ≤ B, and thus ∥w(0) – w*∥ ≤ 2 B.

Using the definition of R in equation (33.20), we have R = O( B). The

Image 1540

Image 1541

bound

on the accuracy of the solution after T iterations in

Theorem 33.11 becomes

.

Exercises

33.3-1

Prove Lemma 33.6. Start from the definition of a convex function given

in equation (33.18). ( Hint: You can prove the statement when n = 1 first.

The proof for general values of n is similar.)

33.3-2

Prove Lemma 33.7.

33.3-3

Prove equation (33.29). ( Hint: The proof for n = 1 dimension is straightforward. The proof for general values of n dimensions follows

along similar lines.)

33.3-4

Show that the function f in equation (33.32) is a convex function of the

variables w 0, w 1, …, wn.

33.3-5

Compute the gradient of expression (33.33) and explain how to evaluate

the gradient in O( nm) time.

33.3-6

Consider the function f defined in equation (33.32), and suppose that you have a bound ∥w∥ ≤ B, as is considered in the discussion on

regularization. Show that L = O( B) in this case.

33.3-7

Equation (33.2) on page 1009 gives a function that, when minimized,

gives an optimal solution to the k-means problem. Explain how to use

gradient descent to solve the k-means problem.

Problems

Image 1542

Image 1543

33-1 Newton’s method

Gradient descent iteratively moves closer to a desired value (the

minimum) of a function. Another algorithm in this spirit is known as

Newton’s method, which is an iterative algorithm that finds the root of a

function. Here, we consider Newton’s method which, given a function f :

ℝ → ℝ, finds a value x* such that f( x* ) = 0. The algorithm moves through a series of points x(0), x(1), …. If the algorithm is currently at a point x( t), then to find point x( t+1), it first takes the equation of the line tangent to the curve at x = x( t),

y = f′( x( t))( xx( t)) + f( x( t)).

It then uses the x-intercept of this line as the next point x( t+1).

a. Show that the algorithm described above can be summarized by the

update rule

We restrict our attention to some domain I and assume that f′( x) ≠ 0 for all xI and that f″( x) is continuous. We also assume that the starting point x(0) is sufficiently close to x*, where “sufficiently close” means that we can use only the first two terms of the Taylor expansion of f( x*) about x(0), namely

where γ(0) is some value between x(0) and x*. If the approximation in equation (33.34) holds for x(0), it also holds for any point closer to x*.

b. Assume that the function f has exactly one point x* for which f( x*) =

0. Let ϵ( t) = | x( t) – x*|. Using the Taylor expansion in equation (33.34), show that

Image 1544

Image 1545

Image 1546

Image 1547

Image 1548

where γ( t) is some value between x( t) and x*.

c. If

for some constant c and ϵ(0) < 1, then we say that the function f has quadratic convergence, since the error decreases quadratically.

Assuming that f has quadratic convergence, how many iterations are

needed to find a root of f( x) to an accuracy of δ? Your answer should include δ.

d. Suppose you wish to find a root of the function f( x) = ( x – 3)2, which is also the minimizer, and you start at x(0) = 3.5. Compare the

number of iterations needed by gradient descent to find the minimizer

and Newton’s method to find the root.

33-2 Hedge

Another variant in the multiplicative-weights framework is known as

HEDGE. It differs from WEIGHTED MAJORITY in two ways. First,

HEDGE makes the prediction randomly—in iteration t, it assigns a

probability

to expert Ei, where

. It then

chooses an expert Ei′ according to this probability distribution and

predicts according to Ei′. Second, the update rule is different. If an expert makes a mistake, line 16 updates that expert’s weight by the rule

, for some 0 < ϵ < 1. Show that the expected number of

mistakes made by HEDGE, running for T rounds, is at most m* + (ln

n)/ ϵ + ϵT.

33-3 Nonoptimality of Lloyd’s procedure in one dimension

Give an example to show that even in one dimension, Lloyd’s procedure

for finding clusters does not always return an optimum result. That is,

Lloyd’s procedure may terminate and return as a result a set C of clusters that does not minimize f( S, C), even when S is a set of points on a line.

33-4 Stochastic gradient descent

Consider the problem described in Section 33.3 of fitting a line f( x) = ax

+ b to a given set of point/value pairs S = {( x 1, y 1), …, ( xT, yT)} by optimizing the choice of the parameters a and b using gradient descent

to find a best least-squares fit. Here we consider the case where x is a

real-valued variable, rather than a vector.

Suppose that you are not given the point/value pairs in S all at once,

but only one at a time in an online manner. Furthermore, the points are

given in random order. That is, you know that there are n points, but in

iteration t you are given only ( xi, yi) where i is independently and randomly chosen from {1, …, T}.

You can use gradient descent to compute an estimate to the function.

As each point ( xi, yi) is considered, you can update the current values of a and b by taking the derivative with respect to a and b of the term of the objective function depending on ( xi, yi). Doing so gives you a stochastic estimate of the gradient, and you can then take a small step

in the opposite direction.

Give pseudcode to implement this variant of gradient descent. What

would the expected value of the error be as a function of T, L, and R?

( Hint: Replicate the analysis of GRADIENT-DESCENT in Section

33.3 for this variant.)

This procedure and its variants are known as stochastic gradient

descent.

Chapter notes

For a general introduction to artificial intelligence, we recommend

Russell and Norvig [391]. For a general introduction to machine learning, we recommend Murphy [340].

Lloyd’s procedure for the k-means problem was first proposed by Lloyd [304] and also later by Forgy [151]. It is sometimes called “Lloyd’s algorithm” or the “Lloyd-Forgy algorithm.” Although Mahajan et al.

[310] showed that finding an optimal clustering is NP-hard, even in the plane, Kanungo et al. [241] have shown that there is an approximation algorithm for the k-means problem with approximation ratio 9 + ϵ, for

any ϵ > 0.

The multiplicative-weights method is surveyed by Arora, Hazan, and

Kale [25]. The main idea of updating weights based on feedback has been rediscovered many times. One early use is in game theory, where

Brown defined “Fictitious Play” [74] and conjectured its convergence to the value of a zero-sum game. The convergence properties were

established by Robinson [382].

In machine learning, the first use of multiplicative weights was by

Littlestone in the Winnow algorithm [300], which was later extended by Littlestone and Warmuth to the weighted-majority algorithm described

in Section 33.2 [301]. This work is closely connected to the boosting algorithm, originally due to Freund and Shapire [159]. The multiplicative-weights idea is also closely related to several more general

optimization algorithms, including the perceptron algorithm [328] and algorithms for optimization problems such as packing linear programs

[177, 359].

The treatment of gradient descent in this chapter draws heavily on

the unpublished manuscript of Bansal and Gupta [35]. They emphasize the idea of using a potential function and using ideas from amortized

analysis to explain gradient descent. Other presentations and analyses

of gradient descent include works by Bubeck [75], Boyd and Vanderberghe [69], and Nesterov [343].

Gradient descent is known to converge faster when functions obey

stronger properties than general convexity. For example, a function f is

α-strongly convex if f(y) ≥ f(x) + 〈(∇ f)(x), (y – x)〉 + α∥y – x∥ for all x, y

∈ ℝ n. In this case, GRADIENT-DESCENT can use a variable step

size and return x( T). The step size at step t becomes γt = 1/( α( t + 1)), and the procedure returns a point such that f(x-avg) – f(x*) ≤ L 2/( α( T +

Image 1549

1)). This convergence is better than that of Theorem 33.8 because the

number of iterations needed is linear, rather than quadratic, in the

desired error parameter ϵ, and because the performance is independent

of the initial point.

Another case in which gradient descent can be shown to perform

better than the analysis in Section 33.3 suggests is for smooth convex functions.

We

say

that

a

function

is

β-smooth

if

. This inequality goes in the

opposite direction from the one for -strong convexity. Better bounds

on gradient descent are possible here as well.

1 Although the curve in Figure 33.4 looks concave, according to the definition of convexity that we’ll see below, the function f in the figure is convex.

2 A function f : ℝ n → ℝ is closed if, for each α ∈ ℝ, the set {x ∈ dom( f) : f(x) ≤ α} is closed, where dom( f) is the domain of f.

34 NP-Completeness

Almost all the algorithms we have studied thus far have been

polynomial-time algorithms: on inputs of size n, their worst-case running time is O( nk) for some constant k. You might wonder whether all problems can be solved in polynomial time. The answer is no. For

example, there are problems, such as Turing’s famous “Halting

Problem,” that cannot be solved by any computer, no matter how long

you’re willing to wait for an answer. 1 There are also problems that can be solved, but not in O( nk) time for any constant k. Generally, we think of problems that are solvable by polynomial-time algorithms as being

tractable, or “easy,” and problems that require superpolynomial time as

being intractable, or “hard.”

The subject of this chapter, however, is an interesting class of

problems, called the “NP-complete” problems, whose status is

unknown. No polynomial-time algorithm has yet been discovered for an

NP-complete problem, nor has anyone yet been able to prove that no

polynomial-time algorithm can exist for any one of them. This so-called

P ≠ NP question has been one of the deepest, most perplexing open

research problems in theoretical computer science since it was first

posed in 1971.

Several NP-complete problems are particularly tantalizing because

they seem on the surface to be similar to problems that we know how to

solve in polynomial time. In each of the following pairs of problems, one

is solvable in polynomial time and the other is NP-complete, but the

difference between the problems appears to be slight:

Shortest versus longest simple paths: In Chapter 22, we saw that even with negative edge weights, we can find shortest paths from a single source in a directed graph G = ( V, E) in O( VE) time. Finding a longest simple path between two vertices is difficult, however. Merely

determining whether a graph contains a simple path with at least a

given number of edges is NP-complete.

Euler tour versus hamiltonian cycle: An Euler tour of a strongly

connected, directed graph G = ( V, E) is a cycle that traverses each edge of G exactly once, although it is allowed to visit each vertex more than

once. Problem 20-3 on page 583 asks you to show how to determine

whether a strongly connected, directed graph has an Euler tour and, if

it does, the order of the edges in the Euler tour, all in O( E) time. A

hamiltonian cycle of a directed graph G = ( V, E) is a simple cycle that contains each vertex in V. Determining whether a directed graph has a

hamiltonian cycle is NP-complete. (Later in this chapter, we’ll prove

that determining whether an undirected graph has a hamiltonian cycle

is NP-complete.)

2-CNF satisfiability versus 3-CNF satisfiability: Boolean formulas

contain binary variables whose values are 0 or 1; boolean connectives

such as ∧ (AND), ∨ (OR), and ¬ (NOT); and parentheses. A boolean

formula is satisfiable if there exists some assignment of the values 0

and 1 to its variables that causes it to evaluate to 1. We’ll define terms

more formally later in this chapter, but informally, a boolean formula

is in k-conjunctive normal form, or k-CNF if it is the AND of clauses

of ORs of exactly k variables or their negations. For example, the

boolean formula ( x 1 ∨ x 2) ∧ (¬ x 1 ∨ x 3) ∧ (¬ x 2 ∨ ¬ x 3) is in 2-CNF

(with satisfying assignment x 1 = 1, x 2 = 0, and x 3 = 1). Although there is a polynomial-time algorithm to determine whether a 2-CNF

formula is satisfiable, we’ll see later in this chapter that determining

whether a 3-CNF formula is satisfiable is NP-complete.

NP-completeness and the classes P and NP

Throughout this chapter, we refer to three classes of problems: P, NP, and NPC, the latter class being the NP-complete problems. We describe

them informally here, with formal definitions to appear later on.

The class P consists of those problems that are solvable in

polynomial time. More specifically, they are problems that can be solved

in O( nk) time for some constant k, where n is the size of the input to the problem. Most of the problems examined in previous chapters belong to

P. The class NP consists of those problems that are “verifiable” in

polynomial time. What do we mean by a problem being verifiable? If

you were somehow given a “certificate” of a solution, then you could

verify that the certificate is correct in time polynomial in the size of the

input to the problem. For example, in the hamiltonian-cycle problem,

given a directed graph G = ( V, E), a certificate would be a sequence 〈 v 1, v 2, v 3, …, v| V|〉 of | V| vertices. You could check in polynomial time that the sequence contains each of the | V| vertices exactly once, that ( vi, vi+1)

E for i = 1, 2, 3, …, | V| − 1, and that ( v| V|, v 1) ∈ E. As another example, for 3-CNF satisfiability, a certificate could be an assignment of

values to variables. You could check in polynomial time that this

assignment satisfies the boolean formula.

Any problem in P also belongs to NP, since if a problem belongs to P

then it is solvable in polynomial time without even being supplied a

certificate. We’ll formalize this notion later in this chapter, but for now

you can believe that P ⊆ NP. The famous open question is whether P is

a proper subset of NP.

Informally, a problem belongs to the class NPC—and we call it NP-

complete—if it belongs to NP and is as “hard” as any problem in NP.

We’ll formally define what it means to be as hard as any problem in NP

later in this chapter. In the meantime, we state without proof that if any

NP-complete problem can be solved in polynomial time, then every

problem in NP has a polynomial-time algorithm. Most theoretical

computer scientists believe that the NP-complete problems are

intractable, since given the wide range of NP-complete problems that

have been studied to date—without anyone having discovered a

polynomial-time solution to any of them—it would be truly astounding

if all of them could be solved in polynomial time. Yet, given the effort

devoted thus far to proving that NP-complete problems are intractable

—without a conclusive outcome—we cannot rule out the possibility

that the NP-complete problems could turn out to be solvable in

polynomial time.

To become a good algorithm designer, you must understand the

rudiments of the theory of NP-completeness. If you can establish a

problem as NP-complete, you provide good evidence for its

intractability. As an engineer, you would then do better to spend your

time developing an approximation algorithm (see Chapter 35) or solving a tractable special case, rather than searching for a fast

algorithm that solves the problem exactly. Moreover, many natural and

interesting problems that on the surface seem no harder than sorting,

graph searching, or network flow are in fact NP-complete. Therefore,

you should become familiar with this remarkable class of problems.

Overview of showing problems to be NP-complete

The techniques used to show that a particular problem is NP-complete

differ fundamentally from the techniques used throughout most of this

book to design and analyze algorithms. If you can demonstrate that a

problem is NP-complete, you are making a statement about how hard it

is (or at least how hard we think it is), rather than about how easy it is.

If you prove a problem NP-complete, you are saying that searching for

efficient algorithm is likely to be a fruitless endeavor. In this way, NP-

completeness proofs bear some similarity to the proof in Section 8.1 of an Ω( n lg n)-time lower bound for any comparison sort algorithm, although the specific techniques used for showing NP-completeness

differ from the decision-tree method used in Section 8.1.

We rely on three key concepts in showing a problem to be NP-

complete:

Decision problems versus optimization problems

Many problems of interest are optimization problems, in which each

feasible (i.e., “legal”) solution has an associated value, and the goal is to

find a feasible solution with the best value. For example, in a problem that we call SHORTEST-PATH, the input is an undirected graph G and

vertices u and v, and the goal is to find a path from u to v that uses the fewest edges. In other words, SHORTEST-PATH is the single-pair

shortest-path problem in an unweighted, undirected graph. NP-

completeness applies directly not to optimization problems, however,

but to decision problems, in which the answer is simply “yes” or “no”

(or, more formally, “1” or “0”).

Although NP-complete problems are confined to the realm of

decision problems, there is usually a way to cast a given optimization

problem as a related decision problem by imposing a bound on the

value to be optimized. For example, a decision problem related to

SHORTEST-PATH is PATH: given an undirected graph G, vertices u

and v, and an integer k, does a path exist from u to v consisting of at most k edges?

The relationship between an optimization problem and its related

decision problem works in your favor when you try to show that the

optimization problem is “hard.” That is because the decision problem is

in a sense “easier,” or at least “no harder.” As a specific example, you

can solve PATH by solving SHORTEST-PATH and then comparing the

number of edges in the shortest path found to the value of the decision-

problem parameter k. In other words, if an optimization problem is

easy, its related decision problem is easy as well. Stated in a way that has

more relevance to NP-completeness, if you can provide evidence that a

decision problem is hard, you also provide evidence that its related

optimization problem is hard. Thus, even though it restricts attention to

decision problems, the theory of NP-completeness often has

implications for optimization problems as well.

Reductions

The above notion of showing that one problem is no harder or no easier

than another applies even when both problems are decision problems.

Almost every NP-completeness proof takes advantage of this idea, as

follows. Consider a decision problem A, which you would like to solve

in polynomial time. We call the input to a particular problem an

Image 1550

instance of that problem. For example, in PATH, an instance is a

particular graph G, particular vertices u and v of G, and a particular integer k. Now suppose that you already know how to solve a different

decision problem B in polynomial time. Finally, suppose that you have a

procedure that transforms any instance α of A into some instance β of B

with the following characteristics:

Figure 34.1 How to use a polynomial-time reduction algorithm to solve a decision problem A in polynomial time, given a polynomial-time decision algorithm for another problem B. In polynomial time, transform an instance α of A into an instance β of B, solve B in polynomial time, and use the answer for β as the answer for α.

The transformation takes polynomial time.

The answers are the same. That is, the answer for α is “yes” if and

only if the answer for β is also “yes.”

We call such a procedure a polynomial-time reduction algorithm and, as

Figure 34.1 shows, it provides us a way to solve problem A in polynomial time:

1. Given an instance α of problem A, use a polynomial-time

reduction algorithm to transform it to an instance β of problem

B.

2. Run the polynomial-time decision algorithm for B on the

instance β.

3. Use the answer for β as the answer for α.

As long as each of these steps takes polynomial time, all three together

do also, and so you have a way to decide on α in polynomial time. In

other words, by “reducing” solving problem A to solving problem B, you use the “easiness” of B to prove the “easiness” of A.

Recalling that NP-completeness is about showing how hard a

problem is rather than how easy it is, you use polynomial-time

reductions in the opposite way to show that a problem is NP-complete.

Let’s take the idea a step further and show how you can use polynomial-

time reductions to show that no polynomial-time algorithm can exist for

a particular problem B. Suppose that you have a decision problem A for

which you already know that no polynomial-time algorithm can exist.

(Ignore for the moment how to find such a problem A.) Suppose further

that you have a polynomial-time reduction transforming instances of A

to instances of B. Now you can use a simple proof by contradiction to

show that no polynomial-time algorithm can exist for B. Suppose

otherwise, that is, suppose that B has a polynomial-time algorithm.

Then, using the method shown in Figure 34.1, you would have a way to solve problem A in polynomial time, which contradicts the assumption

that there is no polynomial-time algorithm for A.

To prove that a problem B is NP-complete, the methodology is

similar. Although you cannot assume that there is absolutely no

polynomial-time algorithm for problem A, you prove that problem B is

NP-complete on the assumption that problem A is also NP-complete.

A first NP-complete problem

Because the technique of reduction relies on having a problem already

known to be NP-complete in order to prove a different problem NP-

complete, there must be some “first” NP-complete problem. We’ll use

the circuit-satisfiability problem, in which the input is a boolean

combinational circuit composed of AND, OR, and NOT gates, and the

question is whether there exists some set of boolean inputs to this circuit

that causes its output to be 1. Section 34.3 will prove that this first problem is NP-complete.

Chapter outline

This chapter studies the aspects of NP-completeness that bear most

directly on the analysis of algorithms. Section 34.1 formalizes the notion of “problem” and defines the complexity class P of polynomial-time

solvable decision problems. We’ll also see how these notions fit into the

framework of formal-language theory. Section 34.2 defines the class NP

of decision problems whose solutions are verifiable in polynomial time.

It also formally poses the P ≠ NP question.

Section 34.3 shows how to relate problems via polynomial-time

“reductions.” It defines NP-completeness and sketches a proof that the

circuit-satisfiability problem is NP-complete. With one problem proven

NP-complete, Section 34.4 demonstrates how to prove other problems to be NP-complete much more simply by the methodology of

reductions. To illustrate this methodology, the section shows that two

formula-satisfiability problems are NP-complete. Section 34.5 proves a variety of other problems to be NP-complete by using reductions. You

will probably find several of these reductions to be quite creative,

because they convert a problem in one domain to a problem in a

completely different domain.

34.1 Polynomial time

Since NP-completeness relies on notions of solving a problem and

verifying a certificate in polynomial time, let’s first examine what it

means for a problem to be solvable in polynomial time.

Recall that we generally regard problems that have polynomial-time

solutions as tractable. Here are three reasons why:

1. Although no reasonable person considers a problem that

requires Θ( n 100) time to be tractable, few practical problems

require time on the order of such a high-degree polynomial. The

polynomial-time computable problems encountered in practice

typically require much less time. Experience has shown that once

the first polynomial-time algorithm for a problem has been

discovered, more efficient algorithms often follow. Even if the

current best algorithm for a problem has a running time of

Θ( n 100), an algorithm with a much better running time will

likely soon be discovered.

2. For many reasonable models of computation, a problem that can

be solved in polynomial time in one model can be solved in

polynomial time in another. For example, the class of problems

solvable in polynomial time by the serial random-access machine

used throughout most of this book is the same as the class of

problems solvable in polynomial time on abstract Turing

machines.2 It is also the same as the class of problems solvable in polynomial time on a parallel computer when the number of

processors grows polynomially with the input size.

3. The class of polynomial-time solvable problems has nice closure

properties, since polynomials are closed under addition,

multiplication, and composition. For example, if the output of

one polynomial-time algorithm is fed into the input of another,

the composite algorithm is polynomial. Exercise 34.1-5 asks you

to show that if an algorithm makes a constant number of calls to

polynomial-time subroutines and performs an additional

amount of work that also takes polynomial time, then the

running time of the composite algorithm is polynomial.

Abstract problems

To understand the class of polynomial-time solvable problems, you

must first have a formal notion of what a “problem” is. We define an

abstract problem Q to be a binary relation on a set I of problem instances and a set S of problem solutions. For example, an instance for SHORTEST-PATH is a triple consisting of a graph and two vertices. A

solution is a sequence of vertices in the graph, with perhaps the empty

sequence denoting that no path exists. The problem SHORTEST-PATH

itself is the relation that associates each instance of a graph and two

vertices with a shortest path in the graph that connects the two vertices.

Since shortest paths are not necessarily unique, a given problem

instance may have more than one solution.

This formulation of an abstract problem is more general than

necessary for our purposes. As we saw above, the theory of NP-

completeness restricts attention to decision problems: those having a

yes/no solution. In this case, we can view an abstract decision problem

as a function that maps the instance set I to the solution set {0, 1}. For

example, a decision problem related to SHORTEST-PATH is the

problem PATH that we saw earlier. If i = 〈 G, u, v, k〉 is an instance of PATH, then PATH( i) = 1 (yes) if G contains a path from u to v with at most k edges, and PATH( i) = 0 (no) otherwise. Many abstract problems

are not decision problems, but rather optimization problems, which

require some value to be minimized or maximized. As we saw above,

however, you can usually recast an optimization problem as a decision

problem that is no harder.

Encodings

In order for a computer program to solve an abstract problem, its

problem instances must appear in a way that the program understands.

An encoding of a set S of abstract objects is a mapping e from S to the set of binary strings.3 For example, we are all familiar with encoding the natural numbers ℕ = {0, 1, 2, 3, 4,…} as the strings {0, 1, 10, 11, 100,

…}. Using this encoding, e(17) = 10001. If you have looked at computer

representations of keyboard characters, you probably have seen the

ASCII code, where, for example, the encoding of A is 01000001. You can

encode a compound object as a binary string by combining the

representations of its constituent parts. Polygons, graphs, functions,

ordered pairs, programs—all can be encoded as binary strings.

Thus, a computer algorithm that “solves” some abstract decision

problem actually takes an encoding of a problem instance as input. The

size of an instance i is just the length of its string, which we denote by | i|.

We call a problem whose instance set is the set of binary strings a

concrete problem. We say that an algorithm solves a concrete problem in O( T ( n)) time if, when it is provided a problem instance i of length n =

| i|, the algorithm can produce the solution in O( T ( n)) time. 4 A concrete problem is polynomial-time solvable, therefore, if there exists an

algorithm to solve it in O( nk) time for some constant k.

We can now formally define the complexity class P as the set of

concrete decision problems that are polynomial-time solvable.

Encodings map abstract problems to concrete problems. Given an

abstract decision problem Q mapping an instance set I to {0, 1}, an encoding e : I → {0, 1}* can induce a related concrete decision problem,

which we denote by e( Q). 5 If the solution to an abstract-problem instance iI is Q( i) ∈ {0, 1}, then the solution to the concrete-problem instance e( i) ∈ {0, 1}* is also Q( i). As a technicality, some binary strings might represent no meaningful abstract-problem instance. For

convenience, assume that any such string maps arbitrarily to 0. Thus,

the concrete problem produces the same solutions as the abstract

problem on binary-string instances that represent the encodings of

abstract-problem instances.

We would like to extend the definition of polynomial-time solvability

from concrete problems to abstract problems by using encodings as the

bridge, ideally with the definition independent of any particular

encoding. That is, the efficiency of solving a problem should not depend

on how the problem is encoded. Unfortunately, it depends quite heavily

on the encoding. For example, suppose that the sole input to an

algorithm is an integer k, and suppose that the running time of the algorithm is Θ( k). If the integer k is provided in unary—a string of k 1s

—then the running time of the algorithm is O( n) on length- n inputs, which is polynomial time. If the input k is provided using the more natural binary representation, however, then the input length is n = ⌊lg

k⌊ + 1, so the size of the unary encoding is exponential in the size of the

binary encoding. With the binary representation, the running time of

the algorithm is Θ( k) = Θ(2 n), which is exponential in the size of the input. Thus, depending on the encoding, the algorithm runs in either

polynomial or superpolynomial time.

The encoding of an abstract problem matters quite a bit to how we

understand polynomial time. We cannot really talk about solving an

abstract problem without first specifying an encoding. Nevertheless, in

practice, if we rule out “expensive” encodings such as unary ones, the

actual encoding of a problem makes little difference to whether the

problem can be solved in polynomial time. For example, representing

integers in base 3 instead of binary has no effect on whether a problem

is solvable in polynomial time, since we can convert an integer

represented in base 3 to an integer represented in base 2 in polynomial

time.

We say that a function f : {0, 1}* → {0, 1}* is polynomial-time computable if there exists a polynomial-time algorithm A that, given any input x ∈ {0, 1}*, produces as output f ( x). For some set I of problem instances, we say that two encodings e 1 and e 2 are polynomially related if there exist two polynomial-time computable functions f 12 and f 21

such that for any iI, we have f 12( e 1( i)) = e 2( i) and f 21( e 2( i)) = e 1( i). 6

That is, a polynomial-time algorithm can compute the encoding e 2( i) from the encoding e 1( i), and vice versa. If two encodings e 1 and e 2 of an abstract problem are polynomially related, whether the problem is

polynomial-time solvable or not is independent of which encoding we

use, as the following lemma shows.

Lemma 34.1

Let Q be an abstract decision problem on an instance set I, and let e 1

and e 2 be polynomially related encodings on I. Then, e 1( Q) ∈ P if and only if e 2( Q) ∈ P.

Proof We need only prove the forward direction, since the backward

direction is symmetric. Suppose, therefore, that e 1( Q) can be solved in O( nk) time for some constant k. Furthermore, suppose that for any problem instance i, the encoding e 1( i) can be computed from the encoding e 2( i) in O( nc) time for some constant c, where n = | e 2( i)|. To solve problem e 2( Q) on input e 2( i), first compute e 1( i) and then run the algorithm for e 1( Q) on e 1( i). How long does this procedure take?

Converting encodings takes O( nc) time, and therefore | e 1( i)| = O( nc), since the output of a serial computer cannot be longer than its running

time. Solving the problem on e 1( i) takes O(| e 1( i)| k) = O( nck) time, which is polynomial since both c and k are constants.

Thus, whether an abstract problem has its instances encoded in binary or base 3 does not affect its “complexity,” that is, whether it is

polynomial-time solvable or not. If instances are encoded in unary,

however, its complexity may change. In order to be able to converse in

an encoding-independent fashion, we generally assume that problem

instances are encoded in any reasonable, concise fashion, unless we

specifically say otherwise. To be precise, we assume that the encoding of

an integer is polynomially related to its binary representation, and that

the encoding of a finite set is polynomially related to its encoding as a

list of its elements, enclosed in braces and separated by commas. (ASCII

is one such encoding scheme.) With such a “standard” encoding in

hand, we can derive reasonable encodings of other mathematical

objects, such as tuples, graphs, and formulas. To denote the standard

encoding of an object, we enclose the object in angle brackets. Thus, 〈 G

denotes the standard encoding of a graph G.

As long as the encoding implicitly used is polynomially related to

this standard encoding, we can talk directly about abstract problems

without reference to any particular encoding, knowing that the choice

of encoding has no effect on whether the abstract problem is

polynomial-time solvable. From now on, we will generally assume that

all problem instances are binary strings encoded using the standard

encoding, unless we explicitly specify the contrary. We’ll also typically

neglect the distinction between abstract and concrete problems. You

should watch out for problems that arise in practice, however, in which a

standard encoding is not obvious and the encoding does make a

difference.

A formal-language framework

By focusing on decision problems, we can take advantage of the

machinery of formal-language theory. Let’s review some definitions

from that theory. An alphabet Σ is a finite set of symbols. A language L

over Σ is any set of strings made up of symbols from Σ. For example, if

Σ = {0, 1}, the set L = {10, 11, 101, 111, 1011, 1101, 10001,…} is the

language of binary representations of prime numbers. We denote the

empty string by ε, the empty language by Ø, and the language of all

strings over Σ by Σ*. For example, if Σ = {0, 1}, then Σ* = { ε, 0, 1, 00, 01, 10, 11, 000,…} is the set of all binary strings. Every language L over

Σ is a subset of Σ*.

Languages support a variety of operations. Set-theoretic operations,

such as union and intersection, follow directly from the set-theoretic definitions. We define the complement of a language L by L = Σ* − L.

The concatenation L 1 L 2 of two languages L 1 and L 2 is the language L = { x 1 x 2 : x 1 ∈ L 1 and x 2 ∈ L 2}.

The closure or Kleene star of a language L is the language

L* = { ε} ∪ LL 2 ∪ L 3 ∪ …,

where Lk is the language obtained by concatenating L to itself k times.

From the point of view of language theory, the set of instances for

any decision problem Q is simply the set Σ*, where Σ = {0, 1}. Since Q is entirely characterized by those problem instances that produce a 1 (yes)

answer, we can view Q as a language L over Σ = {0, 1}, where

L = { x ∈ Σ* : Q( x) = 1}.

For example, the decision problem PATH has the corresponding

language

PATH = {〈 G, u, v, G = ( V, E) is an undirected graph,

k〉:

u, vV,

k ≥ 0 is an integer, and

G contains a path from u to v with at most k

edges}.

(Where convenient, we’ll sometimes use the same name—PATH in this

case—to refer to both a decision problem and its corresponding

language.)

The formal-language framework allows us to express concisely the

relation between decision problems and algorithms that solve them. We

say that an algorithm A accepts a string x ∈ {0, 1}* if, given input x,

the algorithm’s output A( x) is 1. The language accepted by an algorithm A is the set of strings L = { x ∈ {0, 1}* : A( x) = 1}, that is, the set of strings that the algorithm accepts. An algorithm A rejects a string x if A( x) = 0.

Even if language L is accepted by an algorithm A, the algorithm does not necessarily reject a string xL provided as input to it. For example, the algorithm might loop forever. A language L is decided by

an algorithm A if every binary string in L is accepted by A and every binary string not in L is rejected by A. A language L is accepted in polynomial time by an algorithm A if it is accepted by A and if in addition there exists a constant k such that for any length- n string x

L, algorithm A accepts x in O( nk) time. A language L is decided in polynomial time by an algorithm A if there exists a constant k such that for any length- n string x ∈ {0, 1}*, the algorithm correctly decides whether xL in O( nk) time. Thus, to accept a language, an algorithm need only produce an answer when provided a string in L, but to decide

a language, it must correctly accept or reject every string in {0, 1}*.

As an example, the language PATH can be accepted in polynomial

time. One polynomial-time accepting algorithm verifies that G encodes

an undirected graph, verifies that u and v are vertices in G, uses breadth-first search to compute a path from u to v in G with the fewest edges, and then compares the number of edges on the path obtained with k. If

G encodes an undirected graph and the path found from u to v has at most k edges, the algorithm outputs 1 and halts. Otherwise, the

algorithm runs forever. This algorithm does not decide PATH, however,

since it does not explicitly output 0 for instances in which a shortest

path has more than k edges. A decision algorithm for PATH must

explicitly reject binary strings that do not belong to PATH. For a

decision problem such as PATH, such a decision algorithm is

straightforward to design: instead of running forever when there is not a

path from u to v with at most k edges, it outputs 0 and halts. (It must also output 0 and halt if the input encoding is faulty.) For other

problems, such as Turing’s Halting Problem, there exists an accepting

algorithm, but no decision algorithm exists.

We can informally define a complexity class as a set of languages, membership in which is determined by a complexity measure, such as

running time, of an algorithm that determines whether a given string x

belongs to language L. The actual definition of a complexity class is somewhat more technical.7

Using this language-theoretic framework, we can provide an

alternative definition of the complexity class P:

P = { L ⊆ {0, there exists an algorithm A that decides L in 1}*:

polynomial time}.

In fact, as the following theorem shows, P is also the class of languages

that can be accepted in polynomial time.

Theorem 34.2

P = { L : L is accepted by a polynomial-time algorithm}.

Proof Because the class of languages decided by polynomial-time

algorithms is a subset of the class of languages accepted by polynomial-

time algorithms, we need only show that if L is accepted by a

polynomial-time algorithm, it is decided by a polynomial-time

algorithm. Let L be the language accepted by some polynomial-time

algorithm A. We use a classic “simulation” argument to construct

another polynomial-time algorithm A′ that decides L. Because A accepts L in O( nk) time for some constant k, there also exists a constant c such that A accepts L in at most cnk steps. For any input string x, the algorithm A′ simulates cnk steps of A. After simulating cnk steps, algorithm A′ inspects the behavior of A. If A has accepted x, then A

accepts x by outputting a 1. If A has not accepted x, then A′ rejects x by outputting a 0. The overhead of A′ simulating A does not increase the

running time by more than a polynomial factor, and thus A′ is a

polynomial-time algorithm that decides L.

The proof of Theorem 34.2 is nonconstructive. For a given language

L ∈ P, we may not actually know a bound on the running time for the

algorithm A that accepts L. Nevertheless, we know that such a bound exists, and therefore, that an algorithm A′ exists that can check the bound, even though we may not be able to find the algorithm A′ easily.

Exercises

34.1-1

Define the optimization problem LONGEST-PATH-LENGTH as the

relation that associates each instance of an undirected graph and two

vertices with the number of edges in a longest simple path between the

two vertices. Define the decision problem LONGEST-PATH = {〈 G, u, v,

k〉 : G = ( V, E) is an undirected graph, u, vV, k ≥ 0 is an integer, and there exists a simple path from u to v in G consisting of at least k edges}.

Show that the optimization problem LONGEST-PATH-LENGTH can

be solved in polynomial time if and only if LONGEST-PATH ∈ P.

34.1-2

Give a formal definition for the problem of finding the longest simple