their collective wisdom in order to make good predictions. Let’s denote
the experts, n of them, by E 1, E 2, …, En, and let’s say that T events are going to take place. Each event has an outcome of either 0 or 1, with
o( t) denoting the outcome of the t th event. Before event t, each expert E( i) makes a prediction
. You, as the “learner,” then take the
set of n expert predictions for event t and produce a single prediction p( t) ∈ {0, 1} of your own. You base your prediction only on the predictions of the experts and anything you have learned about the
experts from their previous predictions. You do not use any additional
information about the event. Only after making your prediction do you
ascertain the outcome o( t) of event t. If your prediction p( t) matches o( t), then you were correct; otherwise, you made a mistake. The goal is to minimize the total number m of mistakes, where
.
You can also keep track of the number of mistakes each expert makes:
expert Ei makes mi mistakes, where
.
For example, suppose that you are following the price of a stock, and
each day you decide whether to invest in it for just that day by buying it
at the beginning of the day and selling it at the end of the day. If, on
some day, you buy the stock and it goes up, then you made the correct
decision, but if the stock goes down, then you made a mistake.
Similarly, if on some day, you do not buy the stock and it goes down,
then you made the correct decision, but if the stock goes up, then you
made a mistake. Since you would like to make as few mistakes as
possible, you use the advice of the experts to make your decisions.
We’ll assume nothing about the movement of the stock. We’ll also
assume nothing about the experts: the experts’ predictions could be
correlated, they could be chosen to deceive you, or perhaps some are
not really experts after all. What algorithm would you use?
Before designing an algorithm for this problem, we need to consider
what is a fair way to evaluate our algorithm. It is reasonable to expect
that our algorithm performs better when the expert predictions are better, and that it performs worse when the expert predictions are worse.
The goal of the algorithm is to limit the number of mistakes you make
to be close to the number of mistakes that the best of the experts makes.
At first, this goal might seem impossible, because you do not know until
the end which expert is best. We’ll see, however, that by taking the
advice provided by all the experts into account, you can achieve this
goal. More formally, we use the notion of “regret,” which compares our
algorithm to the performance of the best expert (in hindsight) over all.
Letting m* = min { mi : 1 ≤ i ≤ n} denote the number of mistakes made by the best expert, the regret is m – m*. The goal is to design an algorithm with low regret. (Regret can be negative, although it typically
isn’t, since it is rare that you do better than the best expert.)
As a warm-up, let’s consider the case in which one of the experts
makes a correct prediction each time. Even without knowing who that
expert is, you can still achieve good results.
Lemma 33.3
Suppose that out of n experts, there is one who always makes the correct
prediction for all T events. Then there is an algorithm that makes at most ⌈1g n⌉ mistakes.
Proof The algorithm maintains a set S consisting of experts who have
not yet made a mistake. Initially, S contains all n experts. The algorithm’s prediction is always the majority vote of the predictions of
the experts remaining in set S. In case of a tie, the algorithm makes any
prediction. After each outcome is learned, set S is updated to remove all
the experts who made an incorrect prediction about that outcome.
We now analyze the algorithm. The expert who always makes the
correct prediction will always be in set S. Every time the algorithm makes a mistake, at least half of the experts who were still in S also make a mistake, and these experts are removed from S. If S′ is the set of experts remaining after removing those who made a mistake, we have
that | S′| ≤ | S|/2. The size of S can be halved at most ⌈1g n⌉ times until | S|
= 1. From this point on, we know that the algorithm never makes a
mistake, since the set S consists only of the one expert who never makes


a mistake. Therefore, overall the algorithm makes at most ⌈1g n⌉
mistakes.
▪
Exercise 33.2-1 asks you to generalize this result to the case when
there is no expert who makes perfect predictions and show that, for any
set of experts, there is an algorithm that makes at most m* ⌈1g n⌉
mistakes. The generalized algorithm begins in the same way. The set S
might become empty at some point, however. If that ever happens, reset
S to contain all the experts and continue the algorithm.
You can substantially improve your prediction ability by not just
tracking which experts have not made any mistakes, or have not made
any mistakes recently, to a more nuanced evaluation of the quality of
each expert. The key idea is to use the feedback you receive to update
your evaluation of how much trust to put in each expert. As the experts
make predictions, you observe whether they were correct and decrease
your confidence in the experts who make more mistakes. In this way,
you can learn over time which experts are more reliable and which are
less reliable, and weight their predictions accordlingly. The change in
weights is accomplished via multiplication, hence the term
“multiplicative weights.”
The algorithm appears in the procedure WEIGHTED-MAJORITY
on the following page, which takes a set E = { E 1, E 2, …, En} of experts, a number T of events, the number n of experts, and a parameter 0 < γ ≤
1/2 that controls how the weights change. The algorithm maintains
weights for i = 1, 2, …, n and t = 1, 2, …, T, where
. The
for loop of lines 1–2 sets the initial weights
to 1, capturing the idea
that with no knowledge, you trust each expert equally. Each iteration of
the main for loop of lines 3–18 does the following for an event t = 1, 2,
…, T. Each expert Ei makes a prediction for event t in line 4. Lines 5–8
compute upweight( t), the sum of the weights of the experts who predict
1 for event t, and downweight( t), the sum of the weights of the experts who predict 0 for the event. Lines 9–11 decide the algorithm’s prediction
p( t) for event t based on whichever weighted sum is larger (breaking ties in favor of deciding 1). The outcome of event t is revealed in line 12.









Finally, lines 14–17 decrease the weights of the experts who made an
incorrect prediction for event t by multiplying their weights by 1 – γ, leaving alone the weights of the experts who correctly predicted the
event’s outcome. Thus, the fewer mistakes each expert makes, the higher
that expert’s weight.
The WEIGHTED-MAJORITY procedure doesn’t do much worse
than any expert. In particular, it doesn’t do much worse than the best
expert. To quantify this claim, let m( t) be the number of mistakes made by the procedure through event t, and let
be the number of mistakes
made by expert Ei through event t. The following theorem is the key.
WEIGHTED-MAJORITY( E, T, n, γ)
1 for i = 1 to n
2
// trust each expert equally
3 for t = 1 to T
4
each expert Ei ∈ E makes a prediction
5
// experts who predicted 1
6
// sum of weights of who predicted 1
7
// experts who predicted 0
8
// sum of weights of who predicted 0
9
if upweight( t) ≥ downweight( t)
10
p( t) = 1
// algorithm predicts 1
11
else p( t) = 0
// algorithm predicts 0
12
outcome o( t) is revealed
13
// If p( t) ≠ o( t), the algorithm made a mistake.
14
for i = 1 to n
15
if
// if expert E( i) made a mistake …
16
// … then decrease that expert’s weight
17
else
18
return p( t)




Theorem 33.4
When running WEIGHTED-MAJORITY, we have, for every expert Ei
and every event T′ ≤ T,
Proof Every time an expert Ei makes a mistake, its weight, which is
initially 1, is multiplied by 1 – γ, and so we have
for t = 1, 2, …, T.
We use a potential function
, summing the weights for
all n experts after iteration t of the for loop of lines 3–18. Initially, we have W(0) = n since all n weights start out with the value 1. Because each expert belongs to either the set U or the set D (defined in lines 5
and 7 of WEIGHTED-MAJORITY), we always have W( t) =
upweight( t) + downweight( t) after each execution of line 8.
Consider an iteration t in which the algorithm makes a mistake in its
prediction, which means that either the algorithm predicts 1 and the
outcome is 0 or the algorithm predicts 0 and the outcome is 1. Without
loss of generality, assume that the algorithm predicts 1 and the outcome
is 0. The algorithm predicted 1 because upweight( t) ≥ downweight( t) in line 9, which implies that
Each expert in U then has its weight multiplied by 1 – γ, and each expert in D has its weight unchanged. Thus, we have






Therefore, for every iteration t in which the algorithm makes a mistake,
we have
In an iteration where the algorithm does not make a mistake, some of
the weights decrease and some remain unchanged, so that we have
Since there are m( T′) mistakes made through iteration T′, and W(1) = n, we can repeatedly apply inequality (33.8) to iterations where the
algorithm makes a mistake and inequality (33.9) to iterations where the
algorithm does not make a mistake, obtaining
Because the function W is the sum of the weights and all weights are
positive, its value exceeds any single weight. Therefore, using equation
(33.6) we have, for any expert Ei and for any iteration T′ ≤ T, Combining inequalities (33.10) and (33.11) gives
Taking the natural logarithm of both sides yields
We now use the Taylor series expansion to derive upper and lower
bounds on the logarithmic factors in inequality (33.12). The Taylor
series for ln(1+ x) is given in equation (3.22) on page 67. Substituting − x
for x, we have that for 0 < x ≤ 1/2,
Since each term on the right-hand side is negative, we can drop all terms
except the first and obtain an upper bound of ln(1 – x) ≤ − x. Since 0 < γ
≤ 1/2, we have








For the lower bound, Exercise 33.2-2 asks you to show that ln(1 – x) ≥
− x − x 2 when 0 < x ≤ 1/2, so that
Thus, we have
so that
Subtracting ln n from both sides of inequality (33.16) and then
multiplying both sides by −2/ γ yields
, thus
proving the theorem.
▪
Theorem 33.4 applies to any expert and any event T′ ≤ T. In particular, we can compare against the best expert after all events have
occurred, producing the following corollary.
Corollary 33.5
At the end of procedure WEIGHTED-MAJORITY, we have
▪
Let’s explore this bound. Assuming that
, we can
choose
and plug into inequality (33.17) to obtain
and so the number of errors is at most twice the number of errors made by the best expert plus a term that is often slower growing than m*.
Exercise 33.2-4 shows that you can decrease the bound on the number
of errors by a factor of 2 by using randomization, which leads to much
stronger bounds. In particular, the upper bound on regret ( m – m*) is
reduced from (1 + 2 γ) m* + (2 ln n)/ γ to an expected value of ϵm* + (ln n)/ ϵ, where both γ and ϵ are at most 1/2. Numerically, we can see that if γ = 1/2, WEIGHTED-MAJORITY makes at most 3 times the number
of errors as the best expert, plus 4 ln n errors. As another example, suppose that T = 1000 predictions are being made by n = 20 experts, and the best expert is correct 95% of the time, making 50 errors. Then
WEIGHTED-MAJORITY makes at most 100(1+ γ)+2 ln 20/ γ errors.
By choosing γ = 1/4, WEIGHTED-MAJORITY makes at most 149
errors, or a success rate of at least 85%.
Multiplicative weights methods typically refer to a broader class of
algorithms that includes WEIGHTED-MAJORITY. The outcomes
and predictions need not be only 0 or 1, but can be real numbers, and
there can be a loss associated with a particular outcome and prediction.
The weights can be updated by a multiplicative factor that depends on
the loss, and the algorithm can, given a set of weights, treat them as a
distribution on experts and use them to choose an expert to follow in
each event. Even in these more general settings, bounds similar to
Theorem 33.4 hold.
Exercises
33.2-1
The proof of Lemma 33.3 assumes that some expert never makes a
mistake. It is possible to generalize the algorithm and analysis to
remove this assumption. The new algorithm begins in the same way.
The set S might become empty at some point, however. If that ever
happens, reset S to contain all the experts and continue the algorithm.
Show that the number of mistakes that this algorithm makes is at most
m* ⌈1g n⌉.
33.2-2
Show that ln(1 – x) ≥ − x – x 2 when 0 < x ≤ 1/2. ( Hint: Start with equation (33.13), group all the terms after the first three, and use
equation (A.7) on page 1142.)
33.2-3
Consider a randomized variant of the algorithm given in the proof of
Lemma 33.3, in which some expert never makes a mistake. At each step,
choose an expert Ei uniformly at random from the set S and then make
the same predication as Ei. Show that the expected number of mistakes
made by this algorithm is ⌈1g n⌉.
33.2-4
Consider a randomized version of WEIGHTED-MAJORITY. The
algorithm is the same, except for the prediction step, which interprets
the weights as a probability distribution over the experts and chooses an
expert Ei according to that distribution. It then chooses its prediction to
be the same as the prediction made by expert Ei. Show that, for any 0 <
ϵ < 1/2, the expected number of mistakes made by this algorithm is at
most (1 + ϵ) m* + (ln n)/ ϵ.
Suppose that you have a set { p 1, p 2, …, pn} of points and you want to find the line that best fits these points. For any line ℓ, there is a distance
di between each point pi and the line. You want to find the line that minimizes some function f( d 1, …, dn). There are many possible choices for the definition of distance and for the function f. For example, the distance can be the projection distance to the line and the function can
be the sum of the squares of the distances. This type of problem is
common in data science and machine learning—the line is the
hypothesis that best describes the data—where the particular definition
of best is determined by the definition of distance and the objective f. If
the definition of distance and the function f are linear, then we have a
linear-programming problem, as discussed in Chapter 29. Although the
linear-programming framework captures several important problems,
many other problems, including various machine-learning problems,
have objectives and constraints that are not necessarily linear. We need
frameworks and algorithms to solve such problems.
In this section, we consider the problem of optimizing a continuous
function and discuss one of the most popular methods to do so:
gradient descent. Gradient descent is a general method for finding a
local minimum of a function f : ℝ n → ℝ, where informally, a local minimum of a function f is a point x for which f(x) ≤ f(x′) for all x′ that are “near” x. When the function is convex, it can find a point near the
global minimizer of f: an n-vector argument x = ( x 1, x 2, …, xn) such that f(x) is minimum. For the intuitive idea behind gradient descent, imagine being in a landscape of hills and valleys, and wanting to get to a
low point as quickly as possible. You survey the terrain and choose to
move in the direction that takes you downhill the fastest from your
current position. You move in that direction, but only for a short while,
because as you proceed, the terrain changes and you might need to
choose a different direction. So you stop, reevaluate the possible
directions and move another short distance in the steepest downhill
direction, which might differ from the direction of your previous
movement. You continue this process until you reach a point from
which all directions lead up. Such a point is a local minimum.
In order to make this informal procedure more formal, we need to
define the gradient of a function, which in the analogy above is a
measure of the steepness of the various directions. Given a function f :
ℝ n → ℝ, its gradient ∇ f is a function ∇ f : ℝ n → ℝ n comprising n partial derivatives:
. Analogous to the derivative of a
function of a single variable, the gradient can be viewed as a direction in
which the function value locally increases the fastest, and the rate of
that increase. This view is informal; in order to make it formal we would
have to define what local means and place certain conditions, such as
continuity or existence of derivatives, on the function. Nevertheless, this
view motivates the key step of gradient descent—move in the direction
opposite to the gradient, by a distance influenced by the magnitude of the gradient.
The general procedure of gradient descent proceeds in steps. You
start at some initial point x(0), which is an n-vector. At each step t, you compute the value of the gradient of f at point x( t), that is, (∇ f)(x( t)), which is also an n-vector. You then move in the direction opposite to the
gradient in each dimension at x( t) to arrive at the next point x( t+1), which again is an n-vector. Because you moved in a monotonically
decreasing direction in each dimension, you should have that f(x( t+1)) ≤
f(x( t)). Several details are needed to turn this idea into an actual algorithm. The two main details are that you need an initial point and
that you need to decide how far to move in the direction of the negative
gradient. You also need to understand when to stop and what you can
conclude about the quality of the solution found. We will explore these
issues further in this section, for both constrained minimization, where
there are additional constraints on the points, and unconstrained
minimization, where there are none.
Unconstrained gradient descent
In order to gain intuition, let’s consider unconstrained gradient descent
in just one dimension, that is, when f is a function of a scalar x, so that f
: ℝ → ℝ. In this case, the gradient ∇ f of f is just f′( x), the derivative of f with respect to x. Consider the function f shown in blue in Figure 33.3, with minimizer x* and starting point x(0). The gradient (derivative) f′
( x(0)), shown in orange, has a negative slope, so that a small step from
x(0) in the direction of increasing x results in a point x′ for which f( x′) < f( x(0)). Too large a step, however, results in a


Figure 33.3 A function f : ℝ → ℝ, shown in blue. Its gradient at point x(0), in orange, has a negative slope, and so a small increase in x from x(0) to x′ results in f( x′) < f( x(0)). Small increases in x from x(0) head toward , which gives a local minimum. Too large an increase in x can end up at x″, where f( x″) > f( x(0)). Small steps starting from x(0) and going only in the direction of decreasing values of f cannot end up at the global minimizer x*.
point x″ for which f( x″) > f( x(0)), so this is a bad idea. Restricting ourselves to small steps, where each one has f( x′) < f( x), eventually results in getting close to point , which gives a local minimum. By
taking only small downhill steps, however, gradient descent has no
chance to get to the global minimizer x*, given the starting point x(0).
We draw two observations from this simple example. First, gradient
descent converges toward a local minimum, and not necessarily a global
minimum. Second, the speed at which it converges and how it behaves
are related to properties of the function, to the initial point, and to the
step size of the algorithm.
The procedure GRADIENT-DESCENT on the facing page takes as
input a function f, an initial point x(0) ∈ ℝ n, a fixed step-size multiplier γ > 0, and a number T > 0 of steps to take. Each iteration of the for loop of lines 2–4 performs a step by computing the n-dimensional gradient at
point x( t) and then moving distance γ in the opposite direction in the n-
dimensional space. The complexity of computing the gradient depends
on the function f and can sometimes be expensive. Line 3 sums the
points visited. After the loop terminates, line 6 returns x-avg, the
average of all the points visited except for the last one, x( T). It might seem more natural to return x( T), and in fact, in many circumstances,
you might prefer to have the function return x( T). For the version we
will analyze, however, we use x-avg.
GRADIENT-DESCENT( f, x(0), γ, T)
1 sum = 0
// n-dimensional vector, initially all 0
2 for t = 0 to T – 1
3
sum = sum + x( t)
// add each of n dimensions into sum
4
x( t+1) = x( t) – γ · (∇ f)(x( t)) // (∇ f)(x( t)), x( t+1) are n-
dimensional
5 x-avg = sum/ T
// divide each of n dimensions by T
6 return x-avg
Figure 33.4 depicts how gradient descent ideally runs on a convex 1-dimensional function.1 We’ll define convexity more formally below, but the figure shows that each iteration moves in the direction opposite to
the gradient, with the distance moved being proportional to the
magnitude of the gradient. As the iterations proceed, the magnitude of
the gradient decreases, and thus the distance moved along the
horizontal axis decreases. After each iteration, the distance to the
optimal point x* decreases. This ideal behavior is not guaranteed to
occur in general, but the analysis in the remainder of this section
formalizes when this behavior occurs and quantifies the number of
iterations needed. Gradient descent does not always work, however. We
have already seen that if the function is not convex, gradient descent can
converge to a local, rather than global, minimum. We have also seen
that if the step size is too large, GRADIENT-DESCENT can overshoot
the minimum and wind up farther away. (It is also possible to overshoot
the minimum and wind up closer to the optimum.)
Analysis of unconstrained gradient descent for convex functions

Our analysis of gradient descent focuses on convex functions. Inequality
(C.29) on page 1194 defines a convex function of one variable, as shown
in Figure 33.5. We can extend that definition to a function f : ℝ n → ℝ
and say that f is convex if for all x, y ∈ ℝ n and for all 0 ≤ λ ≤ 1, we have (Inequalities (33.18) and (C.29) are the same, except for the dimensions
of x and y.) We also assume that our convex functions are closed2 and differentiable.
Figure 33.4 An example of running gradient descent on a convex function f : ℝ → ℝ, shown in blue. Beginning at point x(0), each iteration moves in the direction opposite to the gradient, and the distance moved is proportional to the magnitude of the gradient. Orange lines represent the negative of the gradient at each point, scaled by the step size γ. As the iterations proceed, the magnitude of the gradient decreases, and the distance moved decreases correspondingly. After each iteration, the distance to the optimal point x* decreases.
Figure 33.5 A convex function f : ℝ → ℝ, shown in blue, with local and global minimizer x*.
Because f is convex, f( λ x + (1 – λ)y) ≤ λf(x) + (1 – λ) f(y) for any two values x and y and all 0 ≤ λ ≤
1, shown for a particular value of λ. Here, the orange line segment represents all values λf(x) + (1
– λ) f(y) for 0 ≤ λ ≤ 1, and it is above the blue line.
A convex function has the property that any local minimum is also a
global minimum. To verify this property, consider inequality (33.18),
and suppose for the purpose of contradiction that x is a local minimum
but not a global minimum and y ≠ x is a global minimum, so f(y) < f(x).
Then we have
f( λ x + (1 – λ)y) ≤ λf(x) + (1 – λ) f(y) (by inequality (33.18))
< λf(x) + (1 – λ) f(x)
= f(x).
Thus, letting approach 1, we see that there is another point near x, say
x′, such that f(x′) < f(x), so x is not a local minimum.
Convex functions have several useful properties. The first property,
whose proof we leave as Exercise 33.3-1, says that a convex function
always lies above its tangent hyperplane. In the context of gradient
descent, angle brackets denote the notation for inner product defined on
page 1219 rather than denoting a sequence.
Lemma 33.6
For any convex differentiable function f : ℝ n → ℝ and for all x, y ∈ ℝ n, we have ≤ f(x) ≤ f(y) + 〈(∇ f)(x), x – y〉.
▪
The second property, which Exercise 33.3-2 asks you to prove, is a
repeated application of the definition of convexity in inequality (33.18).


Lemma 33.7
For any convex function f : ℝ n → ℝ, for any integer T ≥ 1, and for all x(0), …, x( T–1) ∈ ℝ n, we have
▪
The left-hand side of inequality (33.19) is the value of f at the vector
x-avg that GRADIENT-DESCENT returns.
We now proceed to analyze GRADIENT-DESCENT. It might not
return the exact global minimizer x*. We use an error bound ϵ, and we
want to choose T so that f(x-avg) – f(x*) ≤ ϵ at termination. The value of ϵ depends on the number T of iterations and two additional values.
First, since you expect it to be better to start close to the global
minimizer, ϵ is a function of
the euclidean norm (or distance, defined on page 1219) of the difference
between x(0) and x*. The error bound ϵ is also a function of a quantity
we call L, which is an upper bound on the magnitude ∥(∇ f)(x)∥ of the
gradient, so that
where x ranges over all the points x(0), …, x( T–1) whose gradients are
computed by GRADIENT-DESCENT. Of course, we don’t know the
values of L and R, but for now let’s assume that we do. We’ll discuss later how to remove these assumptions. The analysis of GRADIENT-DESCENT is summarized in the following theorem.
Theorem 33.8
Let x* ∈ ℝ n be the minimizer of a convex function f, and suppose that
an execution of GRADIENT-DESCENT( f, x(0), γ, T) returns x-avg,






where
and R and L are defined in equations (33.20) and
(33.21). Let
. Then we have f(x-avg) – f(x*) ≤ ϵ.
▪
We now prove this theorem. We do not give an absolute bound on
how much progress each iteration makes. Instead, we use a potential
function, as in Section 16.3. Here, we define a potential Φ( t) after computing x( t), such that Φ( t) ≥ 0 for t = 0, …, T. We define the amortized progress in the iteration that computes x( t) as
Along with including the change in potential (Φ( t + 1) – Φ( t)), equation (33.22) also subtracts the minimum value f(x*) because ultimately, you
care not about the values f(x( t)) but about how close they are to f(x*).
Suppose that we can show that p( t) ≤ B for some value B and t = 0, …, T – 1. Then we can substitute for p( t) using equation (33.22), giving Summing inequality (33.23) over t = 0, …, T – 1 yields
Observing that we have a telescoping series on the right and regrouping
terms, we have that
Dividing by T and dropping the positive term Φ( T) gives
and thus we have




In other words, if we can show that p( t) ≤ B for some value B and choose a potential function where Φ(0) is not too large, then inequality
(33.25) tells us how close the function value f(x-avg) is to the function
value f(x*) after T iterations. That is, we can set the error bound ϵ to B
+ Φ(0)/ T.
In order to bound the amortized progress, we need to come up with
a concrete potential function. Define the potential function Φ( t) by
that is, the potential function is proportional to the square of the
distance between the current point and the minimizer x*. With this
potential function in hand, the next lemma provides a bound on the
amortized progress made in any iteration of GRADIENT-DESCENT.
Lemma 33.9
Let x* ∈ ℝ n be the minimizer of a convex function f, and consider an
execution of GRADIENT-DESCENT( f, x(0), γ, T). Then for each point x( t) computed by the procedure, we have that
Proof We first bound the potential change Φ( t + 1) – Φ( t). Using the definition of Φ( t) from equation (33.26), we have
From line 4 in GRADIENT-DESCENT, we know that




and so we would like to rewrite equation (33.27) to have x( t+1) – x( t)
terms. As Exercise 33.3-3 asks you to prove, for any two vectors a, b ∈
ℝ n, we have
Letting a = x( t) – x* and b = x( t+1) – x( t), we can write the right-hand side of equation (33.27) as
. Then we can express the
potential change as
and thus we have
We can now proceed to bound p( t). By the bound on the potential
change from inequality (33.31), and using the definition of L (inequality
(33.21)), we have
▪
sult in the following theorem




Having bounded the amortized progress in one step, we now analyze
the entire GRADIENT-DESCENT procedure, completing the proof of
Theorem 33.8.
Proof of Theorem 33.8 Inequality (33.25) tells us that if we have an upper bound of B for p( t), then we also have the bound f(x-avg) – f(x*)
≤ B + Φ(0)/ T. By equations (33.20) and (33.26), we have that Φ(0) =
R 2/(2 γ). Lemma 33.9 gives us the upper bound of B = γL 2/2, and so we have
Our choice of
in the statement of Theorem 33.8 balances
the two terms, and we obtain
Since we chose
in the theorem statement, the proof is
complete.
▪
Continuing under the assumption that we know R (from equation
(33.20)) and L (from inequality (33.21)), we can think of the analysis in
a slightly different way. We can presume that we have a target accuracy
ϵ and then compute the number of iterations needed. That is, we can
solve
for T, obtaining T = R 2 L 2/ ϵ 2. The number of iterations thus depends on the square of R and L and, most
importantly, on 1/ ϵ 2. (The definition of L from inequality (33.21) depends on T, but we may know an upper bound on L that doesn’t
depend on the particular value of T.) Thus, if you want to halve your error bound, you need to run four times as many iterations.
It is quite possible that we don’t really know R and L, since you’d
need to know x* in order to know R (since R = ∥x(0) – x*∥), and you
might not have an explicit upper bound on the gradient, which would
provide L. You can, however, interpret the analysis of gradient descent
as a proof that there is some step size for which the procedure makes
progress toward the minimum. You can then compute a step size for
which f(x( t)) – f(x( t+1)) is large enough. In fact, not having a fixed step size multiplier can actually help in practice, as you are free to use any
step size s that achieves sufficient decrease in the value of f. You can search for a step size that achieves a large decrease via a binary-search-like routine, which is often called line search. For a given function f and step size s, define the function g(x( t), s) = f(x( t)) – s(∇ f)(x( t)). Start with a small step size s for which g(x( t), s) ≤ f(x( t)). Then repeatedly double s until g(x( t), 2 s) ≥ g(x( t), s), and then perform a binary search in the interval [ s, 2 s]. This procedure can produce a step size that achieves a significant decrease in the objective function. In other circumstances,
however, you may know good upper bounds on R and L, typically from
problem-specific information, which can suffice.
The dominant computational step in each iteration of the for loop of
lines 2–4 is computing the gradient. The complexity of computing and
evaluating a gradient varies widely, depending on the application at
hand. We’ll discuss several applications later.
Constrained gradient descent
We can adapt gradient descent for constrained minimization to
minimize a closed convex function f(x), subject to the additional
requirement that x ∈ K, where K is a closed convex body. A body K ⊆
ℝ n is convex if for all x, y ∈ K, the convex combination λ x+(1– λ)y ∈ K
for all 0 ≤ λ ≤ 1. A closed convex body contains its limit points.
Somewhat surprisingly, restricting to the constrained problem does not
significantly increase the number of iterations of gradient descent. The
idea is that you run the same algorithm, but in each iteration, check whether the current point x( t) is still within the convex body K. If it is not, just move to the closest point in K. Moving to the closest point is
known as projection. We formally define the projection ∏ K(x) of a point x in n dimensions onto a convex body K as the point y ∈ K such that ∥x
– y∥ = min {∥x – z∥ : z ∈ K}. If we have x ∈ K, then ∏ K(x) = x.
This one change yields the procedure GRADIENT-DESCENT-
CONSTRAINED, in which line 4 of GRADIENT-DESCENT is
replaced by two lines. It assumes that x(0) ∈ K. Line 4 of GRADIENT-
DESCENT-CONSTRAINED moves in the direction of the negative
gradient, and line 5 projects back onto K. The lemma that follows helps
to show that when x* ∈ K, if the projection step in line 5 moves from a
point outside of K to a point in K, it cannot be moving away from x*.
GRADIENT-DESCENT-CONSTRAINED( f, x(0), γ, T, K)
1 sum = 0
// n-dimensional vector, initially all 0
2 for t = 0 to T – 1
3
sum = sum + x( t)
// add each of n dimensions into sum
4
x′( t+1) = x( t) – γ · (∇ f)(x( t)) // (∇ f)(x( t)), x′( t+1) are n-
dimensional
5
x( t+1) = ∏ K(x( t+1))
// project onto K
6 x-avg = sum/ T
// divide each of n dimensions by T
7 return x-avg
Lemma 33.10
Consider a convex body K ⊆ ℝ n and points a ∈ K and b′ ∈ ℝ n. Let b =
∏ K(b′). Then ∥b – a∥2 ≤ ∥b′ – a∥2.
Proof If b′ ∈ K, then b = b′ and the claim is true. Otherwise, b′ ≠ b, and as Figure 33.6 shows, we can extend the line segment between b and b′
to a line ℓ. Let c be the projection of a onto ℓ. Point c may or may not be
in K, and if a is on the boundary of K, then c could coincide with b. If c
coincides with b (part (c) of the figure), then abb′ is a right triangle, and
so ∥b – a∥2 ≤ ∥b′ – a∥2. If c does not coincide with b (parts (a) and (b) of
the figure), then because of convexity, the angle ∠abb′ must be obtuse.
Because angle ∠abb′ is obtuse, b lies between c and b′ on ℓ .
Furthermore, because c is the projection of a onto line ℓ, acb and acb′
must be right triangles. By the Pythagorean theorem, we have that ∥b′ –
a∥2 = ∥a – c∥2+∥c – b′∥2 and ∥b – a∥2 = ∥a – c∥2+∥c – b∥2. Subtracting
these two equations gives ∥b′ – a∥2 – ∥b – a∥2 = ∥c – b′∥2 – ∥c – b∥2.
Because b is between c and b′, we must have ∥c – b′∥2 ≥ ∥c – b∥2, and
thus ∥b′ – a∥2 – ∥b – a∥2 ≥ 0. The lemma follows.
Figure 33.6 Projecting a point b′ outside the convex body K to the closest point b = ∏ K(b′) in K.
Line ℓ is the line containing b and b′, and point c is the projection of a onto ℓ. (a) When c is in K.
(b) When c is not in K. (c) When a is on the boundary of K and c coincides with b.
▪
We can now repeat the entire proof for the unconstrained case and
obtain the same bounds. Lemma 33.10 with a = x*, b = x( t+1), and b′ =
x′( t+1) tells us that ∥x( t+1)–x*∥2 ≤ ∥x′( t+1)–x*∥2. We can therefore derive an upper bound that matches inequality (33.31). We continue to
define Φ( t) as in equation (33.26), but noting that x( t+1), computed in line 5 of GRADIENT-DESCENT-CONSTRAINED, has a different
meaning here from in inequality (33.31):


With the same upper bound on the change in the potential function as
in equation (33.30), the entire proof of Lemma 33.9 can proceed as
before. We can therefore conclude that the procedure GRADIENT-
DESCENT-CONSTRAINED has the same asymptotic complexity as
GRADIENT-DESCENT. We summarize this result in the following
theorem.
Theorem 33.11
Let K ⊆ ℝ n be a convex body, x* ∈ ℝ n be the minimizer of a convex function f over K, and
, where R and L are defined in
equations (33.20) and (33.21). Suppose that the vector x-avg is returned
by an execution of GRADIENT-DESCENT-CONSTRAINED( f, x(0),
γ, T, K). Let
. Then we have f(x-avg) – f(x*) ≤ ϵ.
▪
Applications of gradient descent
Gradient descent has many applications to minimizing functions and is
widely used in optimization and machine learning. Here we sketch how
it can be used to solve linear systems. Then we discuss an application to
machine learning: prediction using linear regression.
In Chapter 28, we saw how to use Gaussian elimination to solve a system of linear equations A x = b, thereby computing x = A−1b. If A is an n × n matrix and b is a length- n vector, then the running time of



Gaussian elimination is Θ( n 3), which for large matrices might be
prohibitively expensive. If an approximate solution is acceptable,
however, you can use gradient descent.
First, let’s see how to use gradient descent as a roundabout—and
admittedly inefficient—way to solve for x in the scalar equation ax = b, where a, x, b ∈ ℝ. This equation is equivalent to ax – b = 0. If ax – b is the derivative of a convex function f( x), then ax – b = 0 for the value of x that minimizes f( x). Given f( x), gradient descent can then determine this minimizer. Of course, f( x) is just the integral of ax – b, that is,
, which is convex if a ≥ 0. Therefore, one way to solve ax
= b for a ≥ 0 is to find the minimizer for
via gradient descent.
We now generalize this idea to higher dimensions, where using
gradient descent may actually lead to a faster algorithm. One n-
dimensional analog is the function
, where A is an n ×
n matrix. The gradient of f with respect to x is the function A x – b. To find the value of x that minimizes f, we set the gradient of f to 0 and solve for x. Solving A x–b = 0 for x, we obtain x = A−1b, Thus, minimizing f(x) is equivalent to solving A x = b. If f(x) is convex, then gradient descent can approximately compute this minimum.
A 1-dimensional function is convex when its second derivative is
positive. The equivalent definition for a multidimensional function is
that it is convex when its Hessian matrix is positive-semidefinite (see
page 1222 for a definition), where the Hessian matrix (∇2 f)(x) of a function f(x) is the matrix in which entry ( i, j) is the partial derivative of f with respect to i and j:
Analogous to the 1-dimensional case, the Hessian of f is just A, and so if A is a positive-semidefinite matrix, then we can use gradient descent to

find a point x where A x ≈ b. If R and L are not too large, then this method is faster than using Gaussian elimination.
Gradient descent in machine learning
As a concrete example of supervised learning for prediction, suppose
that you want to predict whether a patient will develop heart disease.
For each of m patients, you have n different attributes. For example, you might have n = 4 and the four pieces of data are age, height, blood pressure, and number of close family members with heart disease.
Denote the data for patient i as a vector x( i) ∈ ℝ n, with giving the j th entry in vector x( i). The label of patient i is denoted by a scalar y( i)
∈ ℝ, signifying the severity of the patient’s heart disease. The
hypothesis should capture a relationship between the x( i) values and
y( i). For this example, we make the modeling assumption that the relationship is linear, and therefore the goal is to compute the “best”
linear relationship between the x( i) values and y( i): a linear function f : ℝ n → ℝ such that f(x( i)) ≈ y( i) for each patient i. Of course, no such function may exist, but you would like one that comes as close as
possible. A linear function f can be defined by a vector of weights w =
( w 0, w 1, …, wn), with
When evaluating a machine-learning model, you need to measure
how close each value f(x( i)) is to its corresponding label y( i). In this example, we define the error e( i) ∈ ℝ associated with patient i as e( i) =
f(x( i)) – y( i). The objective function we choose is to minimize the sum of squares of the errors, which is
The objective function is typically called the loss function, and the least-squares error given by equation (33.33) is just one example of
many possible loss functions. The goal is then, given the x( i) and y( i) values, to compute the weights w 0, w 1, …, wn so as to minimize the loss function in equation (33.33). The variables here are the weights w 0, w 1,
…, wn and not the x( i) or y( i) values.
This particular objective is sometimes known as a least-squares fit,
and the problem of finding a linear function to fit data and minimize the
least-squares error is called linear regression. Finding a least-squares fit
is also addressed in Section 28.3.
When the function f is linear, the loss function defined in equation
(33.33) is convex, because it is the sum of squares of linear functions,
which are themselves convex. Therefore, we can apply gradient descent
to compute a set of weights to approximately minimize the least-squares
error. The concrete goal of learning is to be able to make predictions on
new data. Informally, if the features are all reported in the same units
and are from the same range (perhaps from being normalized), then the
weights tend to have a natural interpretation because the features of the
data that are better predictors of the label have a larger associated
weight. For example, you would expect that, after normalization, the
weight associated with the number of family members with heart
disease would be larger than the weight associated with height.
The computed weights form a model of the data. Once you have a
model, you can make predictions, so that given new data, you can
predict its label. In our example, given a new patient x′ who is not part
of the original training data set, you would still hope to predict the
chance that the new patient develops heart disease. You can do so by
computing the label f(x′), incorporating the weights computed by
gradient descent.
For this linear-regression problem, the objective is to minimize the expression in equation (33.33), which is a quadratic in each of the n+1
weights wj. Thus, entry j in the gradient is linear in wj. Exercise 33.3-5
asks you to explicitly compute the gradient and see that it can be
computed in O( nm) time, which is linear in the input size. Compared with the exact method of solving equation (33.33) in Chapter 28, which needs to invert a matrix, gradient descent is typically much faster.
Section 33.1 briefly discussed regularization—the idea that a complicated hypothesis should be penalized in order to avoid overfitting
the training data. Regularization often involves adding a term to the
objective function, but it can also be achieved by adding a constraint.
One way to regularize this example would be to explicitly limit the norm
of the weights, adding a constraint that ∥w∥ ≤ B for some bound B > 0.
(Recall again that the components of the vector w are the variables in
the present application.) Adding this constraint controls the complexity
of the model, as the number of values wj that can have large absolute
value is now limited.
In order to run GRADIENT-DESCENT-CONSTRAINED for any
problem, you need to implement the projection step, as well as to
compute bounds on R and L. We conclude this section by describing these calculations for gradient descent with the constraint ∥w∥ ≤ B.
First, consider the projection step in line 5. Suppose that the update in
line 4 results in a vector w′. The projection is implemented by
computing ∏ k(w′) where K is defined by ∥w∥ ≤ B. This particular projection can be accomplished by simply scaling w′, since we know that
closest point in K to w′ must be the point along the vector whose norm
is exactly B. The amount z by which we need to scale w′ to hit the boundary of K is the solution to the equation z ∥w′∥ = B, which is solved by z = B/∥w′∥. Hence line 5 is implemented by computing w =
w′ B/∥w′∥. Because we always have ∥w∥ ≤ B, Exercise 33.3-6 asks you to
show that the upper bound on the magnitude L of the gradient is O( B).
We also get a bound on R, as follows. By the constraint ∥w∥ ≤ B, we know that both ∥w(0)∥ ≤ B and ∥w*∥ ≤ B, and thus ∥w(0) – w*∥ ≤ 2 B.
Using the definition of R in equation (33.20), we have R = O( B). The

bound
on the accuracy of the solution after T iterations in
Theorem 33.11 becomes
.
Exercises
33.3-1
Prove Lemma 33.6. Start from the definition of a convex function given
in equation (33.18). ( Hint: You can prove the statement when n = 1 first.
The proof for general values of n is similar.)
33.3-2
Prove Lemma 33.7.
33.3-3
Prove equation (33.29). ( Hint: The proof for n = 1 dimension is straightforward. The proof for general values of n dimensions follows
along similar lines.)
33.3-4
Show that the function f in equation (33.32) is a convex function of the
variables w 0, w 1, …, wn.
33.3-5
Compute the gradient of expression (33.33) and explain how to evaluate
the gradient in O( nm) time.
33.3-6
Consider the function f defined in equation (33.32), and suppose that you have a bound ∥w∥ ≤ B, as is considered in the discussion on
regularization. Show that L = O( B) in this case.
33.3-7
Equation (33.2) on page 1009 gives a function that, when minimized,
gives an optimal solution to the k-means problem. Explain how to use
gradient descent to solve the k-means problem.
Problems

33-1 Newton’s method
Gradient descent iteratively moves closer to a desired value (the
minimum) of a function. Another algorithm in this spirit is known as
Newton’s method, which is an iterative algorithm that finds the root of a
function. Here, we consider Newton’s method which, given a function f :
ℝ → ℝ, finds a value x* such that f( x* ) = 0. The algorithm moves through a series of points x(0), x(1), …. If the algorithm is currently at a point x( t), then to find point x( t+1), it first takes the equation of the line tangent to the curve at x = x( t),
y = f′( x( t))( x – x( t)) + f( x( t)).
It then uses the x-intercept of this line as the next point x( t+1).
a. Show that the algorithm described above can be summarized by the
update rule
We restrict our attention to some domain I and assume that f′( x) ≠ 0 for all x ∈ I and that f″( x) is continuous. We also assume that the starting point x(0) is sufficiently close to x*, where “sufficiently close” means that we can use only the first two terms of the Taylor expansion of f( x*) about x(0), namely
where γ(0) is some value between x(0) and x*. If the approximation in equation (33.34) holds for x(0), it also holds for any point closer to x*.
b. Assume that the function f has exactly one point x* for which f( x*) =
0. Let ϵ( t) = | x( t) – x*|. Using the Taylor expansion in equation (33.34), show that




where γ( t) is some value between x( t) and x*.
c. If
for some constant c and ϵ(0) < 1, then we say that the function f has quadratic convergence, since the error decreases quadratically.
Assuming that f has quadratic convergence, how many iterations are
needed to find a root of f( x) to an accuracy of δ? Your answer should include δ.
d. Suppose you wish to find a root of the function f( x) = ( x – 3)2, which is also the minimizer, and you start at x(0) = 3.5. Compare the
number of iterations needed by gradient descent to find the minimizer
and Newton’s method to find the root.
33-2 Hedge
Another variant in the multiplicative-weights framework is known as
HEDGE. It differs from WEIGHTED MAJORITY in two ways. First,
HEDGE makes the prediction randomly—in iteration t, it assigns a
probability
to expert Ei, where
. It then
chooses an expert Ei′ according to this probability distribution and
predicts according to Ei′. Second, the update rule is different. If an expert makes a mistake, line 16 updates that expert’s weight by the rule
, for some 0 < ϵ < 1. Show that the expected number of
mistakes made by HEDGE, running for T rounds, is at most m* + (ln
n)/ ϵ + ϵT.
33-3 Nonoptimality of Lloyd’s procedure in one dimension
Give an example to show that even in one dimension, Lloyd’s procedure
for finding clusters does not always return an optimum result. That is,
Lloyd’s procedure may terminate and return as a result a set C of clusters that does not minimize f( S, C), even when S is a set of points on a line.
33-4 Stochastic gradient descent
Consider the problem described in Section 33.3 of fitting a line f( x) = ax
+ b to a given set of point/value pairs S = {( x 1, y 1), …, ( xT, yT)} by optimizing the choice of the parameters a and b using gradient descent
to find a best least-squares fit. Here we consider the case where x is a
real-valued variable, rather than a vector.
Suppose that you are not given the point/value pairs in S all at once,
but only one at a time in an online manner. Furthermore, the points are
given in random order. That is, you know that there are n points, but in
iteration t you are given only ( xi, yi) where i is independently and randomly chosen from {1, …, T}.
You can use gradient descent to compute an estimate to the function.
As each point ( xi, yi) is considered, you can update the current values of a and b by taking the derivative with respect to a and b of the term of the objective function depending on ( xi, yi). Doing so gives you a stochastic estimate of the gradient, and you can then take a small step
in the opposite direction.
Give pseudcode to implement this variant of gradient descent. What
would the expected value of the error be as a function of T, L, and R?
( Hint: Replicate the analysis of GRADIENT-DESCENT in Section
This procedure and its variants are known as stochastic gradient
descent.
Chapter notes
For a general introduction to artificial intelligence, we recommend
Russell and Norvig [391]. For a general introduction to machine learning, we recommend Murphy [340].
Lloyd’s procedure for the k-means problem was first proposed by Lloyd [304] and also later by Forgy [151]. It is sometimes called “Lloyd’s algorithm” or the “Lloyd-Forgy algorithm.” Although Mahajan et al.
[310] showed that finding an optimal clustering is NP-hard, even in the plane, Kanungo et al. [241] have shown that there is an approximation algorithm for the k-means problem with approximation ratio 9 + ϵ, for
any ϵ > 0.
The multiplicative-weights method is surveyed by Arora, Hazan, and
Kale [25]. The main idea of updating weights based on feedback has been rediscovered many times. One early use is in game theory, where
Brown defined “Fictitious Play” [74] and conjectured its convergence to the value of a zero-sum game. The convergence properties were
established by Robinson [382].
In machine learning, the first use of multiplicative weights was by
Littlestone in the Winnow algorithm [300], which was later extended by Littlestone and Warmuth to the weighted-majority algorithm described
in Section 33.2 [301]. This work is closely connected to the boosting algorithm, originally due to Freund and Shapire [159]. The multiplicative-weights idea is also closely related to several more general
optimization algorithms, including the perceptron algorithm [328] and algorithms for optimization problems such as packing linear programs
The treatment of gradient descent in this chapter draws heavily on
the unpublished manuscript of Bansal and Gupta [35]. They emphasize the idea of using a potential function and using ideas from amortized
analysis to explain gradient descent. Other presentations and analyses
of gradient descent include works by Bubeck [75], Boyd and Vanderberghe [69], and Nesterov [343].
Gradient descent is known to converge faster when functions obey
stronger properties than general convexity. For example, a function f is
α-strongly convex if f(y) ≥ f(x) + 〈(∇ f)(x), (y – x)〉 + α∥y – x∥ for all x, y
∈ ℝ n. In this case, GRADIENT-DESCENT can use a variable step
size and return x( T). The step size at step t becomes γt = 1/( α( t + 1)), and the procedure returns a point such that f(x-avg) – f(x*) ≤ L 2/( α( T +
1)). This convergence is better than that of Theorem 33.8 because the
number of iterations needed is linear, rather than quadratic, in the
desired error parameter ϵ, and because the performance is independent
of the initial point.
Another case in which gradient descent can be shown to perform
better than the analysis in Section 33.3 suggests is for smooth convex functions.
We
say
that
a
function
is
β-smooth
if
. This inequality goes in the
opposite direction from the one for ≈-strong convexity. Better bounds
on gradient descent are possible here as well.
1 Although the curve in Figure 33.4 looks concave, according to the definition of convexity that we’ll see below, the function f in the figure is convex.
2 A function f : ℝ n → ℝ is closed if, for each α ∈ ℝ, the set {x ∈ dom( f) : f(x) ≤ α} is closed, where dom( f) is the domain of f.
Almost all the algorithms we have studied thus far have been
polynomial-time algorithms: on inputs of size n, their worst-case running time is O( nk) for some constant k. You might wonder whether all problems can be solved in polynomial time. The answer is no. For
example, there are problems, such as Turing’s famous “Halting
Problem,” that cannot be solved by any computer, no matter how long
you’re willing to wait for an answer. 1 There are also problems that can be solved, but not in O( nk) time for any constant k. Generally, we think of problems that are solvable by polynomial-time algorithms as being
tractable, or “easy,” and problems that require superpolynomial time as
being intractable, or “hard.”
The subject of this chapter, however, is an interesting class of
problems, called the “NP-complete” problems, whose status is
unknown. No polynomial-time algorithm has yet been discovered for an
NP-complete problem, nor has anyone yet been able to prove that no
polynomial-time algorithm can exist for any one of them. This so-called
P ≠ NP question has been one of the deepest, most perplexing open
research problems in theoretical computer science since it was first
posed in 1971.
Several NP-complete problems are particularly tantalizing because
they seem on the surface to be similar to problems that we know how to
solve in polynomial time. In each of the following pairs of problems, one
is solvable in polynomial time and the other is NP-complete, but the
difference between the problems appears to be slight:
Shortest versus longest simple paths: In Chapter 22, we saw that even with negative edge weights, we can find shortest paths from a single source in a directed graph G = ( V, E) in O( VE) time. Finding a longest simple path between two vertices is difficult, however. Merely
determining whether a graph contains a simple path with at least a
given number of edges is NP-complete.
Euler tour versus hamiltonian cycle: An Euler tour of a strongly
connected, directed graph G = ( V, E) is a cycle that traverses each edge of G exactly once, although it is allowed to visit each vertex more than
once. Problem 20-3 on page 583 asks you to show how to determine
whether a strongly connected, directed graph has an Euler tour and, if
it does, the order of the edges in the Euler tour, all in O( E) time. A
hamiltonian cycle of a directed graph G = ( V, E) is a simple cycle that contains each vertex in V. Determining whether a directed graph has a
hamiltonian cycle is NP-complete. (Later in this chapter, we’ll prove
that determining whether an undirected graph has a hamiltonian cycle
is NP-complete.)
2-CNF satisfiability versus 3-CNF satisfiability: Boolean formulas
contain binary variables whose values are 0 or 1; boolean connectives
such as ∧ (AND), ∨ (OR), and ¬ (NOT); and parentheses. A boolean
formula is satisfiable if there exists some assignment of the values 0
and 1 to its variables that causes it to evaluate to 1. We’ll define terms
more formally later in this chapter, but informally, a boolean formula
is in k-conjunctive normal form, or k-CNF if it is the AND of clauses
of ORs of exactly k variables or their negations. For example, the
boolean formula ( x 1 ∨ x 2) ∧ (¬ x 1 ∨ x 3) ∧ (¬ x 2 ∨ ¬ x 3) is in 2-CNF
(with satisfying assignment x 1 = 1, x 2 = 0, and x 3 = 1). Although there is a polynomial-time algorithm to determine whether a 2-CNF
formula is satisfiable, we’ll see later in this chapter that determining
whether a 3-CNF formula is satisfiable is NP-complete.
NP-completeness and the classes P and NP
Throughout this chapter, we refer to three classes of problems: P, NP, and NPC, the latter class being the NP-complete problems. We describe
them informally here, with formal definitions to appear later on.
The class P consists of those problems that are solvable in
polynomial time. More specifically, they are problems that can be solved
in O( nk) time for some constant k, where n is the size of the input to the problem. Most of the problems examined in previous chapters belong to
P. The class NP consists of those problems that are “verifiable” in
polynomial time. What do we mean by a problem being verifiable? If
you were somehow given a “certificate” of a solution, then you could
verify that the certificate is correct in time polynomial in the size of the
input to the problem. For example, in the hamiltonian-cycle problem,
given a directed graph G = ( V, E), a certificate would be a sequence 〈 v 1, v 2, v 3, …, v| V|〉 of | V| vertices. You could check in polynomial time that the sequence contains each of the | V| vertices exactly once, that ( vi, vi+1)
∈ E for i = 1, 2, 3, …, | V| − 1, and that ( v| V|, v 1) ∈ E. As another example, for 3-CNF satisfiability, a certificate could be an assignment of
values to variables. You could check in polynomial time that this
assignment satisfies the boolean formula.
Any problem in P also belongs to NP, since if a problem belongs to P
then it is solvable in polynomial time without even being supplied a
certificate. We’ll formalize this notion later in this chapter, but for now
you can believe that P ⊆ NP. The famous open question is whether P is
a proper subset of NP.
Informally, a problem belongs to the class NPC—and we call it NP-
complete—if it belongs to NP and is as “hard” as any problem in NP.
We’ll formally define what it means to be as hard as any problem in NP
later in this chapter. In the meantime, we state without proof that if any
NP-complete problem can be solved in polynomial time, then every
problem in NP has a polynomial-time algorithm. Most theoretical
computer scientists believe that the NP-complete problems are
intractable, since given the wide range of NP-complete problems that
have been studied to date—without anyone having discovered a
polynomial-time solution to any of them—it would be truly astounding
if all of them could be solved in polynomial time. Yet, given the effort
devoted thus far to proving that NP-complete problems are intractable
—without a conclusive outcome—we cannot rule out the possibility
that the NP-complete problems could turn out to be solvable in
polynomial time.
To become a good algorithm designer, you must understand the
rudiments of the theory of NP-completeness. If you can establish a
problem as NP-complete, you provide good evidence for its
intractability. As an engineer, you would then do better to spend your
time developing an approximation algorithm (see Chapter 35) or solving a tractable special case, rather than searching for a fast
algorithm that solves the problem exactly. Moreover, many natural and
interesting problems that on the surface seem no harder than sorting,
graph searching, or network flow are in fact NP-complete. Therefore,
you should become familiar with this remarkable class of problems.
Overview of showing problems to be NP-complete
The techniques used to show that a particular problem is NP-complete
differ fundamentally from the techniques used throughout most of this
book to design and analyze algorithms. If you can demonstrate that a
problem is NP-complete, you are making a statement about how hard it
is (or at least how hard we think it is), rather than about how easy it is.
If you prove a problem NP-complete, you are saying that searching for
efficient algorithm is likely to be a fruitless endeavor. In this way, NP-
completeness proofs bear some similarity to the proof in Section 8.1 of an Ω( n lg n)-time lower bound for any comparison sort algorithm, although the specific techniques used for showing NP-completeness
differ from the decision-tree method used in Section 8.1.
We rely on three key concepts in showing a problem to be NP-
complete:
Decision problems versus optimization problems
Many problems of interest are optimization problems, in which each
feasible (i.e., “legal”) solution has an associated value, and the goal is to
find a feasible solution with the best value. For example, in a problem that we call SHORTEST-PATH, the input is an undirected graph G and
vertices u and v, and the goal is to find a path from u to v that uses the fewest edges. In other words, SHORTEST-PATH is the single-pair
shortest-path problem in an unweighted, undirected graph. NP-
completeness applies directly not to optimization problems, however,
but to decision problems, in which the answer is simply “yes” or “no”
(or, more formally, “1” or “0”).
Although NP-complete problems are confined to the realm of
decision problems, there is usually a way to cast a given optimization
problem as a related decision problem by imposing a bound on the
value to be optimized. For example, a decision problem related to
SHORTEST-PATH is PATH: given an undirected graph G, vertices u
and v, and an integer k, does a path exist from u to v consisting of at most k edges?
The relationship between an optimization problem and its related
decision problem works in your favor when you try to show that the
optimization problem is “hard.” That is because the decision problem is
in a sense “easier,” or at least “no harder.” As a specific example, you
can solve PATH by solving SHORTEST-PATH and then comparing the
number of edges in the shortest path found to the value of the decision-
problem parameter k. In other words, if an optimization problem is
easy, its related decision problem is easy as well. Stated in a way that has
more relevance to NP-completeness, if you can provide evidence that a
decision problem is hard, you also provide evidence that its related
optimization problem is hard. Thus, even though it restricts attention to
decision problems, the theory of NP-completeness often has
implications for optimization problems as well.
Reductions
The above notion of showing that one problem is no harder or no easier
than another applies even when both problems are decision problems.
Almost every NP-completeness proof takes advantage of this idea, as
follows. Consider a decision problem A, which you would like to solve
in polynomial time. We call the input to a particular problem an
instance of that problem. For example, in PATH, an instance is a
particular graph G, particular vertices u and v of G, and a particular integer k. Now suppose that you already know how to solve a different
decision problem B in polynomial time. Finally, suppose that you have a
procedure that transforms any instance α of A into some instance β of B
with the following characteristics:
Figure 34.1 How to use a polynomial-time reduction algorithm to solve a decision problem A in polynomial time, given a polynomial-time decision algorithm for another problem B. In polynomial time, transform an instance α of A into an instance β of B, solve B in polynomial time, and use the answer for β as the answer for α.
The transformation takes polynomial time.
The answers are the same. That is, the answer for α is “yes” if and
only if the answer for β is also “yes.”
We call such a procedure a polynomial-time reduction algorithm and, as
Figure 34.1 shows, it provides us a way to solve problem A in polynomial time:
1. Given an instance α of problem A, use a polynomial-time
reduction algorithm to transform it to an instance β of problem
B.
2. Run the polynomial-time decision algorithm for B on the
instance β.
3. Use the answer for β as the answer for α.
As long as each of these steps takes polynomial time, all three together
do also, and so you have a way to decide on α in polynomial time. In
other words, by “reducing” solving problem A to solving problem B, you use the “easiness” of B to prove the “easiness” of A.
Recalling that NP-completeness is about showing how hard a
problem is rather than how easy it is, you use polynomial-time
reductions in the opposite way to show that a problem is NP-complete.
Let’s take the idea a step further and show how you can use polynomial-
time reductions to show that no polynomial-time algorithm can exist for
a particular problem B. Suppose that you have a decision problem A for
which you already know that no polynomial-time algorithm can exist.
(Ignore for the moment how to find such a problem A.) Suppose further
that you have a polynomial-time reduction transforming instances of A
to instances of B. Now you can use a simple proof by contradiction to
show that no polynomial-time algorithm can exist for B. Suppose
otherwise, that is, suppose that B has a polynomial-time algorithm.
Then, using the method shown in Figure 34.1, you would have a way to solve problem A in polynomial time, which contradicts the assumption
that there is no polynomial-time algorithm for A.
To prove that a problem B is NP-complete, the methodology is
similar. Although you cannot assume that there is absolutely no
polynomial-time algorithm for problem A, you prove that problem B is
NP-complete on the assumption that problem A is also NP-complete.
A first NP-complete problem
Because the technique of reduction relies on having a problem already
known to be NP-complete in order to prove a different problem NP-
complete, there must be some “first” NP-complete problem. We’ll use
the circuit-satisfiability problem, in which the input is a boolean
combinational circuit composed of AND, OR, and NOT gates, and the
question is whether there exists some set of boolean inputs to this circuit
that causes its output to be 1. Section 34.3 will prove that this first problem is NP-complete.
Chapter outline
This chapter studies the aspects of NP-completeness that bear most
directly on the analysis of algorithms. Section 34.1 formalizes the notion of “problem” and defines the complexity class P of polynomial-time
solvable decision problems. We’ll also see how these notions fit into the
framework of formal-language theory. Section 34.2 defines the class NP
of decision problems whose solutions are verifiable in polynomial time.
It also formally poses the P ≠ NP question.
Section 34.3 shows how to relate problems via polynomial-time
“reductions.” It defines NP-completeness and sketches a proof that the
circuit-satisfiability problem is NP-complete. With one problem proven
NP-complete, Section 34.4 demonstrates how to prove other problems to be NP-complete much more simply by the methodology of
reductions. To illustrate this methodology, the section shows that two
formula-satisfiability problems are NP-complete. Section 34.5 proves a variety of other problems to be NP-complete by using reductions. You
will probably find several of these reductions to be quite creative,
because they convert a problem in one domain to a problem in a
completely different domain.
Since NP-completeness relies on notions of solving a problem and
verifying a certificate in polynomial time, let’s first examine what it
means for a problem to be solvable in polynomial time.
Recall that we generally regard problems that have polynomial-time
solutions as tractable. Here are three reasons why:
1. Although no reasonable person considers a problem that
requires Θ( n 100) time to be tractable, few practical problems
require time on the order of such a high-degree polynomial. The
polynomial-time computable problems encountered in practice
typically require much less time. Experience has shown that once
the first polynomial-time algorithm for a problem has been
discovered, more efficient algorithms often follow. Even if the
current best algorithm for a problem has a running time of
Θ( n 100), an algorithm with a much better running time will
likely soon be discovered.
2. For many reasonable models of computation, a problem that can
be solved in polynomial time in one model can be solved in
polynomial time in another. For example, the class of problems
solvable in polynomial time by the serial random-access machine
used throughout most of this book is the same as the class of
problems solvable in polynomial time on abstract Turing
machines.2 It is also the same as the class of problems solvable in polynomial time on a parallel computer when the number of
processors grows polynomially with the input size.
3. The class of polynomial-time solvable problems has nice closure
properties, since polynomials are closed under addition,
multiplication, and composition. For example, if the output of
one polynomial-time algorithm is fed into the input of another,
the composite algorithm is polynomial. Exercise 34.1-5 asks you
to show that if an algorithm makes a constant number of calls to
polynomial-time subroutines and performs an additional
amount of work that also takes polynomial time, then the
running time of the composite algorithm is polynomial.
Abstract problems
To understand the class of polynomial-time solvable problems, you
must first have a formal notion of what a “problem” is. We define an
abstract problem Q to be a binary relation on a set I of problem instances and a set S of problem solutions. For example, an instance for SHORTEST-PATH is a triple consisting of a graph and two vertices. A
solution is a sequence of vertices in the graph, with perhaps the empty
sequence denoting that no path exists. The problem SHORTEST-PATH
itself is the relation that associates each instance of a graph and two
vertices with a shortest path in the graph that connects the two vertices.
Since shortest paths are not necessarily unique, a given problem
instance may have more than one solution.
This formulation of an abstract problem is more general than
necessary for our purposes. As we saw above, the theory of NP-
completeness restricts attention to decision problems: those having a
yes/no solution. In this case, we can view an abstract decision problem
as a function that maps the instance set I to the solution set {0, 1}. For
example, a decision problem related to SHORTEST-PATH is the
problem PATH that we saw earlier. If i = 〈 G, u, v, k〉 is an instance of PATH, then PATH( i) = 1 (yes) if G contains a path from u to v with at most k edges, and PATH( i) = 0 (no) otherwise. Many abstract problems
are not decision problems, but rather optimization problems, which
require some value to be minimized or maximized. As we saw above,
however, you can usually recast an optimization problem as a decision
problem that is no harder.
Encodings
In order for a computer program to solve an abstract problem, its
problem instances must appear in a way that the program understands.
An encoding of a set S of abstract objects is a mapping e from S to the set of binary strings.3 For example, we are all familiar with encoding the natural numbers ℕ = {0, 1, 2, 3, 4,…} as the strings {0, 1, 10, 11, 100,
…}. Using this encoding, e(17) = 10001. If you have looked at computer
representations of keyboard characters, you probably have seen the
ASCII code, where, for example, the encoding of A is 01000001. You can
encode a compound object as a binary string by combining the
representations of its constituent parts. Polygons, graphs, functions,
ordered pairs, programs—all can be encoded as binary strings.
Thus, a computer algorithm that “solves” some abstract decision
problem actually takes an encoding of a problem instance as input. The
size of an instance i is just the length of its string, which we denote by | i|.
We call a problem whose instance set is the set of binary strings a
concrete problem. We say that an algorithm solves a concrete problem in O( T ( n)) time if, when it is provided a problem instance i of length n =
| i|, the algorithm can produce the solution in O( T ( n)) time. 4 A concrete problem is polynomial-time solvable, therefore, if there exists an
algorithm to solve it in O( nk) time for some constant k.
We can now formally define the complexity class P as the set of
concrete decision problems that are polynomial-time solvable.
Encodings map abstract problems to concrete problems. Given an
abstract decision problem Q mapping an instance set I to {0, 1}, an encoding e : I → {0, 1}* can induce a related concrete decision problem,
which we denote by e( Q). 5 If the solution to an abstract-problem instance i ∈ I is Q( i) ∈ {0, 1}, then the solution to the concrete-problem instance e( i) ∈ {0, 1}* is also Q( i). As a technicality, some binary strings might represent no meaningful abstract-problem instance. For
convenience, assume that any such string maps arbitrarily to 0. Thus,
the concrete problem produces the same solutions as the abstract
problem on binary-string instances that represent the encodings of
abstract-problem instances.
We would like to extend the definition of polynomial-time solvability
from concrete problems to abstract problems by using encodings as the
bridge, ideally with the definition independent of any particular
encoding. That is, the efficiency of solving a problem should not depend
on how the problem is encoded. Unfortunately, it depends quite heavily
on the encoding. For example, suppose that the sole input to an
algorithm is an integer k, and suppose that the running time of the algorithm is Θ( k). If the integer k is provided in unary—a string of k 1s
—then the running time of the algorithm is O( n) on length- n inputs, which is polynomial time. If the input k is provided using the more natural binary representation, however, then the input length is n = ⌊lg
k⌊ + 1, so the size of the unary encoding is exponential in the size of the
binary encoding. With the binary representation, the running time of
the algorithm is Θ( k) = Θ(2 n), which is exponential in the size of the input. Thus, depending on the encoding, the algorithm runs in either
polynomial or superpolynomial time.
The encoding of an abstract problem matters quite a bit to how we
understand polynomial time. We cannot really talk about solving an
abstract problem without first specifying an encoding. Nevertheless, in
practice, if we rule out “expensive” encodings such as unary ones, the
actual encoding of a problem makes little difference to whether the
problem can be solved in polynomial time. For example, representing
integers in base 3 instead of binary has no effect on whether a problem
is solvable in polynomial time, since we can convert an integer
represented in base 3 to an integer represented in base 2 in polynomial
time.
We say that a function f : {0, 1}* → {0, 1}* is polynomial-time computable if there exists a polynomial-time algorithm A that, given any input x ∈ {0, 1}*, produces as output f ( x). For some set I of problem instances, we say that two encodings e 1 and e 2 are polynomially related if there exist two polynomial-time computable functions f 12 and f 21
such that for any i ∈ I, we have f 12( e 1( i)) = e 2( i) and f 21( e 2( i)) = e 1( i). 6
That is, a polynomial-time algorithm can compute the encoding e 2( i) from the encoding e 1( i), and vice versa. If two encodings e 1 and e 2 of an abstract problem are polynomially related, whether the problem is
polynomial-time solvable or not is independent of which encoding we
use, as the following lemma shows.
Lemma 34.1
Let Q be an abstract decision problem on an instance set I, and let e 1
and e 2 be polynomially related encodings on I. Then, e 1( Q) ∈ P if and only if e 2( Q) ∈ P.
Proof We need only prove the forward direction, since the backward
direction is symmetric. Suppose, therefore, that e 1( Q) can be solved in O( nk) time for some constant k. Furthermore, suppose that for any problem instance i, the encoding e 1( i) can be computed from the encoding e 2( i) in O( nc) time for some constant c, where n = | e 2( i)|. To solve problem e 2( Q) on input e 2( i), first compute e 1( i) and then run the algorithm for e 1( Q) on e 1( i). How long does this procedure take?
Converting encodings takes O( nc) time, and therefore | e 1( i)| = O( nc), since the output of a serial computer cannot be longer than its running
time. Solving the problem on e 1( i) takes O(| e 1( i)| k) = O( nck) time, which is polynomial since both c and k are constants.
▪
Thus, whether an abstract problem has its instances encoded in binary or base 3 does not affect its “complexity,” that is, whether it is
polynomial-time solvable or not. If instances are encoded in unary,
however, its complexity may change. In order to be able to converse in
an encoding-independent fashion, we generally assume that problem
instances are encoded in any reasonable, concise fashion, unless we
specifically say otherwise. To be precise, we assume that the encoding of
an integer is polynomially related to its binary representation, and that
the encoding of a finite set is polynomially related to its encoding as a
list of its elements, enclosed in braces and separated by commas. (ASCII
is one such encoding scheme.) With such a “standard” encoding in
hand, we can derive reasonable encodings of other mathematical
objects, such as tuples, graphs, and formulas. To denote the standard
encoding of an object, we enclose the object in angle brackets. Thus, 〈 G〉
denotes the standard encoding of a graph G.
As long as the encoding implicitly used is polynomially related to
this standard encoding, we can talk directly about abstract problems
without reference to any particular encoding, knowing that the choice
of encoding has no effect on whether the abstract problem is
polynomial-time solvable. From now on, we will generally assume that
all problem instances are binary strings encoded using the standard
encoding, unless we explicitly specify the contrary. We’ll also typically
neglect the distinction between abstract and concrete problems. You
should watch out for problems that arise in practice, however, in which a
standard encoding is not obvious and the encoding does make a
difference.
A formal-language framework
By focusing on decision problems, we can take advantage of the
machinery of formal-language theory. Let’s review some definitions
from that theory. An alphabet Σ is a finite set of symbols. A language L
over Σ is any set of strings made up of symbols from Σ. For example, if
Σ = {0, 1}, the set L = {10, 11, 101, 111, 1011, 1101, 10001,…} is the
language of binary representations of prime numbers. We denote the
empty string by ε, the empty language by Ø, and the language of all
strings over Σ by Σ*. For example, if Σ = {0, 1}, then Σ* = { ε, 0, 1, 00, 01, 10, 11, 000,…} is the set of all binary strings. Every language L over
Σ is a subset of Σ*.
Languages support a variety of operations. Set-theoretic operations,
such as union and intersection, follow directly from the set-theoretic definitions. We define the complement of a language L by L = Σ* − L.
The concatenation L 1 L 2 of two languages L 1 and L 2 is the language L = { x 1 x 2 : x 1 ∈ L 1 and x 2 ∈ L 2}.
The closure or Kleene star of a language L is the language
L* = { ε} ∪ L ∪ L 2 ∪ L 3 ∪ …,
where Lk is the language obtained by concatenating L to itself k times.
From the point of view of language theory, the set of instances for
any decision problem Q is simply the set Σ*, where Σ = {0, 1}. Since Q is entirely characterized by those problem instances that produce a 1 (yes)
answer, we can view Q as a language L over Σ = {0, 1}, where
L = { x ∈ Σ* : Q( x) = 1}.
For example, the decision problem PATH has the corresponding
language
PATH = {〈 G, u, v, G = ( V, E) is an undirected graph,
k〉:
u, v ∈ V,
k ≥ 0 is an integer, and
G contains a path from u to v with at most k
edges}.
(Where convenient, we’ll sometimes use the same name—PATH in this
case—to refer to both a decision problem and its corresponding
language.)
The formal-language framework allows us to express concisely the
relation between decision problems and algorithms that solve them. We
say that an algorithm A accepts a string x ∈ {0, 1}* if, given input x,
the algorithm’s output A( x) is 1. The language accepted by an algorithm A is the set of strings L = { x ∈ {0, 1}* : A( x) = 1}, that is, the set of strings that the algorithm accepts. An algorithm A rejects a string x if A( x) = 0.
Even if language L is accepted by an algorithm A, the algorithm does not necessarily reject a string x ∉ L provided as input to it. For example, the algorithm might loop forever. A language L is decided by
an algorithm A if every binary string in L is accepted by A and every binary string not in L is rejected by A. A language L is accepted in polynomial time by an algorithm A if it is accepted by A and if in addition there exists a constant k such that for any length- n string x ∈
L, algorithm A accepts x in O( nk) time. A language L is decided in polynomial time by an algorithm A if there exists a constant k such that for any length- n string x ∈ {0, 1}*, the algorithm correctly decides whether x ∈ L in O( nk) time. Thus, to accept a language, an algorithm need only produce an answer when provided a string in L, but to decide
a language, it must correctly accept or reject every string in {0, 1}*.
As an example, the language PATH can be accepted in polynomial
time. One polynomial-time accepting algorithm verifies that G encodes
an undirected graph, verifies that u and v are vertices in G, uses breadth-first search to compute a path from u to v in G with the fewest edges, and then compares the number of edges on the path obtained with k. If
G encodes an undirected graph and the path found from u to v has at most k edges, the algorithm outputs 1 and halts. Otherwise, the
algorithm runs forever. This algorithm does not decide PATH, however,
since it does not explicitly output 0 for instances in which a shortest
path has more than k edges. A decision algorithm for PATH must
explicitly reject binary strings that do not belong to PATH. For a
decision problem such as PATH, such a decision algorithm is
straightforward to design: instead of running forever when there is not a
path from u to v with at most k edges, it outputs 0 and halts. (It must also output 0 and halt if the input encoding is faulty.) For other
problems, such as Turing’s Halting Problem, there exists an accepting
algorithm, but no decision algorithm exists.
We can informally define a complexity class as a set of languages, membership in which is determined by a complexity measure, such as
running time, of an algorithm that determines whether a given string x
belongs to language L. The actual definition of a complexity class is somewhat more technical.7
Using this language-theoretic framework, we can provide an
alternative definition of the complexity class P:
P = { L ⊆ {0, there exists an algorithm A that decides L in 1}*:
polynomial time}.
In fact, as the following theorem shows, P is also the class of languages
that can be accepted in polynomial time.
Theorem 34.2
P = { L : L is accepted by a polynomial-time algorithm}.
Proof Because the class of languages decided by polynomial-time
algorithms is a subset of the class of languages accepted by polynomial-
time algorithms, we need only show that if L is accepted by a
polynomial-time algorithm, it is decided by a polynomial-time
algorithm. Let L be the language accepted by some polynomial-time
algorithm A. We use a classic “simulation” argument to construct
another polynomial-time algorithm A′ that decides L. Because A accepts L in O( nk) time for some constant k, there also exists a constant c such that A accepts L in at most cnk steps. For any input string x, the algorithm A′ simulates cnk steps of A. After simulating cnk steps, algorithm A′ inspects the behavior of A. If A has accepted x, then A′
accepts x by outputting a 1. If A has not accepted x, then A′ rejects x by outputting a 0. The overhead of A′ simulating A does not increase the
running time by more than a polynomial factor, and thus A′ is a
polynomial-time algorithm that decides L.
▪
The proof of Theorem 34.2 is nonconstructive. For a given language
L ∈ P, we may not actually know a bound on the running time for the
algorithm A that accepts L. Nevertheless, we know that such a bound exists, and therefore, that an algorithm A′ exists that can check the bound, even though we may not be able to find the algorithm A′ easily.
Exercises
34.1-1
Define the optimization problem LONGEST-PATH-LENGTH as the
relation that associates each instance of an undirected graph and two
vertices with the number of edges in a longest simple path between the
two vertices. Define the decision problem LONGEST-PATH = {〈 G, u, v,
k〉 : G = ( V, E) is an undirected graph, u, v ∈ V, k ≥ 0 is an integer, and there exists a simple path from u to v in G consisting of at least k edges}.
Show that the optimization problem LONGEST-PATH-LENGTH can
be solved in polynomial time if and only if LONGEST-PATH ∈ P.
34.1-2
Give a formal definition for the problem of finding the longest simple