CONTENTS

How Learning Differs from Pure Optimization

  • Optimization algorithms used for training of deep models differ from traditional optimization algorithms in several ways.

  • Machine learning usually acts indirectly. In most machine learning scenarios, we care about some performance measure P P P, that is defined with respect to the test set and may also be intractable. We therefore optimize P P P only indirectly. We reduce a different cost function J ( θ ) J(\boldsymbol{\theta}) J(θ) in the hope that doing so will improve P P P. This is in contrast to pure optimization, where minimizing J J J is a goal in and of itself. Optimization algorithms for training deep models also typically include some specialization on the specific structure of machine learning objective functions.

  • Typically, the cost function can be written as an average over the training set, such as
    J ( θ ) = E ( x , y ) ∼ p ^ d a t a L ( f ( x ; θ ) , y ) J(\boldsymbol{\theta})=\mathbb{E}_{(\boldsymbol{x}, \mathrm{y}) \sim \hat{p}_{\mathrm{data}}} L(f(\boldsymbol{x} ; \boldsymbol{\theta}), y) J(θ)=E(x,y)∼p^data​L(f(x;θ),y)
    where L L L is the per-example loss function, f ( x ; θ ) f(\boldsymbol{x} ; \boldsymbol{\theta}) f(x;θ) is the predicted output when the input is x , p ^ data  \boldsymbol{x}, \hat{p}_{\text {data }} x,p^data ​ is the empirical distribution. In the supervised learning case, y y y is the target output.
    Throughout this chapter, we develop the unregularized supervised case, where the arguments to L L L are f ( x ; θ ) f(\boldsymbol{x} ; \boldsymbol{\theta}) f(x;θ) and y y y.
    However, it is trivial to extend this development, for example, to include θ \boldsymbol{\theta} θ or x \boldsymbol{x} x as arguments, or to exclude y y y as arguments, in order to develop various forms of regularization or unsupervised learning.

  • Equatian above defines an objective function with respect to the training set. We would usually prefer to minimize the corresponding objective function where the expectation is taken across the data generating distribution p data  p_{\text {data }} pdata ​ rather than just over the finite training set:
    J ∗ ( θ ) = E ( x , y ) ∼ p d a t a L ( f ( x ; θ ) , y ) J^{*}(\boldsymbol{\theta})=\mathbb{E}_{(\boldsymbol{x}, \mathrm{y}) \sim p_{\mathrm{data}}} L(f(\boldsymbol{x} ; \boldsymbol{\theta}), y) J(θ)=E(x,y)∼pdata​L(f(x;θ),y)

Empirical Risk Minimization

  • The goal of a machine learning algorithm is to reduce the expected generalization error given by equation above. This quantity is known as the risk. We emphasize here that the expectation is taken over the true underlying distribution p data  p_{\text {data }} pdata .
    If we knew the true distribution p data  ( x , y ) p_{\text {data }}(\boldsymbol{x}, y) pdata ​(x,y), risk minimization would be an optimization task solvable by an optimization algorithm.
    However, when we do not know p data  ( x , y ) p_{\text {data }}(\boldsymbol{x}, y) pdata ​(x,y) but only have a training set of samples, we have a machine learning problem.
    The simplest way to convert a machine learning problem back into an optimization problem is to minimize the expected loss on the training set. This means replacing the true distribution p ( x , y ) p(\boldsymbol{x}, y) p(x,y) with the empirical distribution p ^ ( x , y ) \hat{p}(\boldsymbol{x}, y) p^​(x,y) defined by the training set. We now minimize the empirical risk
    E x , y ∼ p ^ d a t a ( x , y ) [ L ( f ( x ; θ ) , y ) ] = 1 m ∑ i = 1 m L ( f ( x ( i ) ; θ ) , y ( i ) ) \mathbb{E}_{\boldsymbol{x}, \mathrm{y} \sim \hat{p}_{\mathrm{data}}(\boldsymbol{x}, y)}[L(f(\boldsymbol{x} ; \boldsymbol{\theta}), y)]=\frac{1}{m} \sum_{i=1}^{m} L\left(f\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right), y^{(i)}\right) Ex,y∼p^data​(x,y)​[L(f(x;θ),y)]=m1i=1m​L(f(x(i);θ),y(i))
    where m m m is the number of training examples. The training process based on minimizing this average training error is known as empirical risk minimization. In this setting, machine learning is still very similar to straightforward optimization.
    Rather than optimizing the risk directly, we optimize the empirical risk, and hope that the risk decreases significantly as well. A variety of theoretical results establish conditions under which the true risk can be expected to decrease by various amounts.

  • However, empirical risk minimization is prone(易于) to overfitting. Models with high capacity can simply memorize the training set. In many cases, empirical risk minimization is not really feasible.
    The most effective modern optimization algorithms are based on gradient descent, but many useful loss functions, such as 0 − 1 0-1 0−1 loss, have no useful derivatives (the derivative is either zero or undefined everywhere).
    These two problems mean that, in the context of deep learning, we rarely use empirical risk minimization. Instead, we must use a slightly different approach, in which the quantity that we actually optimize is even more different from the quantity that we truly want to optimize.

Surrogate Loss Functions and Early Stopping

  • Sometimes, the loss function we actually care about (say classification error) is not one that can be optimized efficiently. For example, exactly minimizing expected 0 − 1 0-1 0−1 loss is typically intractable (exponential in the input dimension), even for a linear classifier (Marcotte and Savard, 1992). In such situations, one typically optimizes a surrogate loss function instead, which acts as a proxy but has advantages.
    For example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0 - 1 loss. The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation.

  • In some cases, a surrogate loss function actually results in being able to learn more. For example, the test set 0 - 1 loss often continues to decrease for a long time after the training set 0 − 1 0-1 0−1 loss has reached zero, when training using the log-likelihood surrogate. This is because even when the expected 0 - 1 loss is zero, one can improve the robustness of the classifier by further pushing the classes apart from each other, obtaining a more confident and reliable classifier, thus extracting more information from the training data than would have been possible by simply minimizing the average 0 − 1 0-1 0−1 loss on the training set.

  • A very important difference between optimization in general and optimization as we use it for training algorithms is that training algorithms do not usually halt at a local minimum.
    Instead, a machine learning algorithm usually minimizes a surrogate loss function but halts when a convergence criterion based on early stopping (section 7.8) is satisfied. Typically the early stopping criterion is based on the true underlying loss function, such as 0 -1 loss measured on a validation set, and is designed to cause the algorithm to halt whenever overfitting begins to occur. Training often halts while the surrogate loss function still has large derivatives, which is very different from the pure optimization setting, where an optimization algorithm is considered to have converged when the gradient becomes very small.

Batch and Minibatch Algorithms

  • One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a sum over the training examples. Optimization algorithms for machine learning typically compute each update to the parameters based on an expected value of the cost function estimated using only a subset of the terms of the full cost function. For example, maximum likelihood estimation problems, when viewed in log ⁡ \log log space, decompose into a sum over each example:
    θ M L = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ p model  ( x ( i ) , y ( i ) ; θ ) . \boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{\text {model }}\left(\boldsymbol{x}^{(i)}, y^{(i)} ; \boldsymbol{\theta}\right) . θML​=θargmaxi=1m​logpmodel ​(x(i),y(i);θ).

  • Maximizing this sum is equivalent to maximizing the expectation over the empirical distribution defined by the training set:
    J ( θ ) = E x , y ∼ p ^ data  log ⁡ p model  ( x , y ; θ ) J(\boldsymbol{\theta})=\mathbb{E}_{\mathbf{x}, \mathrm{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\boldsymbol{x}, y ; \boldsymbol{\theta}) J(θ)=Ex,y∼p^data ​logpmodel ​(x,y;θ)

  • Most of the properties of the objective function J J J used by most of our optimization algorithms are also expectations over the training set. For example, the most commonly used property is the gradient:
    ∇ θ J ( θ ) = E x , y ∼ p ^ d a t a ∇ θ log ⁡ p model  ( x , y ; θ ) \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})=\mathbb{E}_{\mathbf{x}, \mathrm{y} \sim \hat{p}_{\mathrm{data}}} \nabla_{\boldsymbol{\theta}} \log p_{\text {model }}(\boldsymbol{x}, y ; \boldsymbol{\theta}) ∇θ​J(θ)=Ex,y∼p^data​∇θ​logpmodel ​(x,y;θ)

  • Computing this expectation exactly is very expensive because it requires evaluating the model on every example in the entire dataset.
    In practice, we can compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples.

  • Recall that the standard error of the mean ( SE ⁡ ( μ ^ m ) = Var ⁡ [ 1 m ∑ i = 1 m x ( i ) ] = σ m \operatorname{SE}\left(\hat{\mu}_{m}\right)=\sqrt{\operatorname{Var}\left[\frac{1}{m} \sum_{i=1}^{m} x^{(i)}\right]}=\frac{\sigma}{\sqrt{m}} SE(μ^m​)=Var[m1​∑i=1m​x(i)] ​=m ​σ​) estimated from n n n samples is given by σ / n \sigma / \sqrt{n} σ/n ​, where σ \sigma σ is the true standard deviation of the value of the samples.
    The denominator(分母) of n \sqrt{n} n ​ shows that there are less than linear returns to using more examples to estimate the gradient.
    Compare two hypothetical estimates of the gradient, one based on 100 examples and another based on 10,000 examples. The latter requires 100 times more computation than the former, but reduces the standard error of the mean only by a factor of 10 . Most optimization algorithms converge much faster (in terms of total computation, not in terms of number of updates) if they are allowed to rapidly compute approximate estimates of the gradient rather than slowly computing the exact gradient.

  • Another consideration motivating statistical estimation of the gradient from a small number of samples is redundancy in the training set. In the worst case, all m m m samples in the training set could be identical copies of each other. A samplingbased estimate of the gradient could compute the correct gradient with a single sample, using m m m times less computation than the naive approach. In practice, we are unlikely to truly encounter this worst-case situation, but we may find large numbers of examples that all make very similar contributions to the gradient.

  • Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all of the training examples simultaneously in a large batch. This terminology can be somewhat confusing because the word “batch” is also often used to describe the minibatch used by minibatch stochastic gradient descent. Typically the term “batch gradient descent” implies the use of the full training set, while the use of the term “batch” to describe a group of examples does not. For example, it is very common to use the term “batch size” to describe the size of a minibatch.

  • Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online methods. The term online is usually reserved for the case where the examples are drawn from a stream of continually created examples rather than from a fixed-size training set over which several passes are made.

  • Most algorithms used for deep learning fall somewhere in between, using more than one but less than all of the training examples. These were traditionally called minibatch or minibatch stochastic methods and it is now common to simply call them stochastic methods.

  • The canonical(典范) example of a stochastic method is stochastic gradient descent, presented in detail in section 8.3.1. Minibatch sizes are generally driven by the following factors:

  1. Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  2. Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
  3. If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
  4. Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
  5. Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1 . Training with such a small batch size might require a small learning rate to maintain stability due to the high variance in the estimate of the gradient. The total runtime can be very high due to the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.
  • Different kinds of algorithms use different kinds of information from the minibatch in different ways.
    Some algorithms are more sensitive to sampling error than others, either because they use information that is difficult to estimate accurately with few samples, or because they use information in ways that amplify sampling errors more.
    Methods that compute updates based only on the gradient g \boldsymbol{g} g are usually relatively robust and can handle smaller batch sizes like 100 .
    Second-order methods, which use also the Hessian matrix H \boldsymbol{H} H and compute updates such as H − 1 g \boldsymbol{H}^{-1} \boldsymbol{g} H−1g, typically require much larger batch sizes like 10 , 000. 10,000 . 10,000. These large batch sizes are required to minimize fluctuations in the estimates of H − 1 g . \boldsymbol{H}^{-1} \boldsymbol{g} . H−1g. Suppose that H \boldsymbol{H} H is estimated perfectly but has a poor condition number. Multiplication by H \boldsymbol{H} H or its inverse amplifies pre-existing errors, in this case, estimation errors in g \boldsymbol{g} g. Very small changes in the estimate of g \boldsymbol{g} g can thus cause large changes in the update H − 1 g \boldsymbol{H}^{-1} \boldsymbol{g} H−1g, even if H \boldsymbol{H} H were estimated perfectly. Of course, H \boldsymbol{H} H will be estimated only approximately, so the update H − 1 g \boldsymbol{H}^{-1} \boldsymbol{g} H−1g will contain even more error than we would predict from applying a poorly conditioned operation to the estimate of g \boldsymbol{g} g.

  • It is also crucial that the minibatches be selected randomly. Computing an unbiased estimate of the expected gradient from a set of samples requires that those samples be independent. We also wish for two subsequent gradient estimates to be independent from each other, so two subsequent minibatches of examples should also be independent from each other. Many datasets are most naturally arranged in a way where successive examples are highly correlated.
    For very large datasets, for example datasets containing billions of examples in a data center, it can be impractical to sample examples truly uniformly at random every time we want to construct a minibatch. Fortunately, in practice it is usually sufficient to shuffle the order of the dataset once and then store it in shuffled fashion. This will impose a fixed set of possible minibatches of consecutive examples that all models trained thereafter will use, and each individual model will be forced to reuse this ordering every time it passes through the training data. However, this deviation from true random selection does not seem to have a significant detrimental effect. Failing to ever shuffle the examples in any way can seriously reduce the effectiveness of the algorithm.

  • Many optimization problems in machine learning decompose over examples well enough that we can compute entire separate updates over different examples in parallel. In other words, we can compute the update that minimizes J ( X ) J(\boldsymbol{X}) J(X) for one minibatch of examples X \boldsymbol{X} X at the same time that we compute the update for several other minibatches. Such asynchronous parallel distributed approaches are discussed further in section 12.1.3.
    An interesting motivation for minibatch stochastic gradient descent is that it follows the gradient of the true generalization error
    J ( θ ) = E x , y ∼ p ^ data  log ⁡ p model  ( x , y ; θ ) J(\boldsymbol{\theta})=\mathbb{E}_{\mathbf{x}, \mathrm{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\boldsymbol{x}, y ; \boldsymbol{\theta}) J(θ)=Ex,y∼p^data ​logpmodel ​(x,y;θ)

    so long as no examples are repeated.
    Most implementations of minibatch stochastic gradient descent shuffle the dataset once and then pass through it multiple times.
    On the first pass, each minibatch is used to compute an unbiased estimate of the true generalization error.
    On the second pass, the estimate becomes biased because it is formed by re-sampling values that have already been used, rather than obtaining new fair samples from the data generating distribution.

  • The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online learning case, where examples or minibatches are drawn from a stream of data. In other words, instead of receiving a fixed-size training set, the learner is similar to a living being who sees a new example at each instant, with every example ( x , y ) (\boldsymbol{x}, y) (x,y) coming from the data generating distribution p data ⁡ ( x , y ) p_{\operatorname{data}}(\boldsymbol{x}, y) pdata​(x,y) In this scenario, examples are never repeated; every experience is a fair sample from p data.  p_{\text {data. }} pdata. 

  • The equivalence is easiest to derive when both x x x and y y y are discrete. In this case, the generalization error can be written as a sum
    J ∗ ( θ ) = ∑ x ∑ y p d a t a ( x , y ) L ( f ( x ; θ ) , y ) J^{*}(\boldsymbol{\theta})=\sum_{\boldsymbol{x}} \sum_{y} p_{\mathrm{data}}(\boldsymbol{x}, y) L(f(\boldsymbol{x} ; \boldsymbol{\theta}), y) J(θ)=xy​pdata​(x,y)L(f(x;θ),y)
    with the exact gradient
    g = ∇ θ J ∗ ( θ ) = ∑ x ∑ y p d a t a ( x , y ) ∇ θ L ( f ( x ; θ ) , y ) \boldsymbol{g}=\nabla_{\boldsymbol{\theta}} J^{*}(\boldsymbol{\theta})=\sum_{\boldsymbol{x}} \sum_{y} p_{\mathrm{data}}(\boldsymbol{x}, y) \nabla_{\boldsymbol{\theta}} L(f(\boldsymbol{x} ; \boldsymbol{\theta}), y) g=∇θ​J(θ)=xy​pdata​(x,y)∇θ​L(f(x;θ),y)

  • we observe now that this holds for other functions L L L besides the likelihood. A similar result can be derived when x \boldsymbol{x} x and y y y are continuous, under mild assumptions regarding p data  p_{\text {data }} pdata ​ and L L L.

  • Hence, we can obtain an unbiased estimator of the exact gradient of the generalization error by sampling a minibatch of examples { x ( 1 ) , … x ( m ) } \left\{\boldsymbol{x}^{(1)}, \ldots \boldsymbol{x}^{(m)}\right\} {x(1),…x(m)} with corresponding targets y ( i ) y^{(i)} y(i) from the data generating distribution p data  p_{\text {data }} pdata ​, and computing the gradient of the loss with respect to the parameters for that minibatch:
    g ^ = 1 m ∇ θ ∑ i L ( f ( x ( i ) ; θ ) , y ( i ) ) \hat{\boldsymbol{g}}=\frac{1}{m} \nabla_{\boldsymbol{\theta}} \sum_{i} L\left(f\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right), y^{(i)}\right) g^​=m1​∇θi​L(f(x(i);θ),y(i))

  • Updating θ \boldsymbol{\theta} θ in the direction of g ^ \hat{\boldsymbol{g}} g^​ performs SGD on the generalization error. Of course, this interpretation only applies when examples are not reused. Nonetheless, it is usually best to make several passes through the training set, unless the training set is extremely large.
    When multiple such epochs are used, only the first epoch follows the unbiased gradient of the generalization error, but of course, the additional epochs usually provide enough benefit due to decreased training error to offset the harm they cause by increasing the gap between training error and test error.

  • With some datasets growing rapidly in size, faster than computing power, it is becoming more common for machine learning applications to use each training example only once or even to make an incomplete pass through the training set. When using an extremely large training set, overfitting is not an issue, so underfitting and computational efficiency become the predominant concerns. See also Bottou and Bousquet (2008) for a discussion of the effect of computational bottlenecks on generalization error, as the number of training examples grows.

Challenges in Neural Network Optimization

  • Traditionally, machine learning has avoided the difficulty of general optimization by carefully designing the objective function and constraints to ensure that the optimization problem is convex.
    When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its complications. In this section, we summarize several of the most prominent challenges involved in optimization for training deep models.

Ill-Conditioning

  • Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H \boldsymbol{H} H. This is a very general problem in most numerical optimization, convex or otherwise, and is described in more detail in section 4.3.1.

  • The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.

  • Recall from equation f ( x ( 0 ) − ϵ g ) ≈ f ( x ( 0 ) ) − ϵ g ⊤ g + 1 2 ϵ 2 g ⊤ H g f\left(\boldsymbol{x}^{(0)}-\epsilon \boldsymbol{g}\right) \approx f\left(\boldsymbol{x}^{(0)}\right)-\epsilon \boldsymbol{g}^{\top} \boldsymbol{g}+\frac{1}{2} \epsilon^{2} \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} f(x(0)−ϵg)≈f(x(0))−ϵgg+21​ϵ2gHg that a second-order Taylor series expansion of the cost function predicts that a gradient descent step of − ϵ g -\epsilon \boldsymbol{g} −ϵg will add
    1 2 ϵ 2 g ⊤ H g − ϵ g ⊤ g \frac{1}{2} \epsilon^{2} \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g}-\epsilon \boldsymbol{g}^{\top} \boldsymbol{g} 21​ϵ2gHg−ϵgg
    to the cost.
    Ill-conditioning of the gradient becomes a problem when 1 2 ϵ 2 g ⊤ H g \frac{1}{2} \epsilon^{2} \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} 21​ϵ2gHg exceeds ϵ g ⊤ g \epsilon \boldsymbol{g}^{\top} \boldsymbol{g} ϵgg.
    To determine whether ill-conditioning is detrimental(有害的) to a neural network training task, one can monitor the squared gradient norm g ⊤ g \boldsymbol{g}^{\top} \boldsymbol{g} gg and the g ⊤ H g \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} gHg term.
    In many cases, the gradient norm does not shrink significantly throughout learning, but the g ⊤ H g \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} gHg term grows by more than an order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature. Figure 8.1 8.1 8.1 shows an example of the gradient increasing significantly during the successful training of a neural network.
    Optimization for Training Deep Models(1)_ios

  • Though ill-conditioning is present in other settings besides neural network training, some of the techniques used to combat it in other contexts are less applicable to neural networks.
    For example, Newton’s method is an excellent tool for minimizing convex functions with poorly conditioned Hessian matrices, but in the subsequent sections we will argue that Newton’s method requires significant modification before it can be applied to neural networks.

Local Minima

  • One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a local minimum. Any local minimum is guaranteed to be a global minimum. Some convex functions have a flat region at the bottom rather than a single global minimum point, but any point within such a flat region is an acceptable solution.
    When optimizing a convex function, we know that we have reached a good solution if we find a critical point of any kind.
    With non-convex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is essentially guaranteed to have an extremely large number of local minima. However, as we will see, this is not necessarily a major problem.

  • Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima because of the model identifiability problem.
    A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters.
    Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent variables with each other. For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i i i with the incoming weight vector for unit j j j, then doing the same for the outgoing weight vectors. If we have m m m layers with n n n units each, then there are n ! m n !^{m} n!m ways of arranging the hidden units. This kind of non-identifiability is known as weight space symmetry.

  • These model identifiability issues mean that there can be an extremely large or even uncountably infinite amount of local minima in a neural network cost function. However, all of these local minima arising from non-identifiability are equivalent to each other in cost function value. As a result, these local minima are not a problematic form of non-convexity.

  • Local minima can be problematic if they have high cost in comparison to the global minimum. One can construct small neural networks, even without hidden units, that have local minima with higher cost than the global minimum (Sontag and Sussman , 1989 ; Brady et al., 1989 ; Gori and Tesi , 1992 )If local minima with high cost are common, this could pose a serious problem for gradient-based optimization algorithms.

  • It remains an open question whether there are many local minima of high cost for networks of practical interest and whether optimization algorithms encounter them. For many years, most practitioners believed that local minima were a common problem plaguing neural network optimization. Today, that does not appear to be the case. The problem remains an active area of research, but experts now suspect that, for sufficiently large neural networks, most local minima have a low cost function value, and that it is not important to find a true global minimum rather than to find a point in parameter space that has low but not minimal cost (Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanskat al., 2014 ) 2014) 2014)

  • Many practitioners attribute nearly all difficulty with neural network optimization to local minima. We encourage practitioners to carefully test for specific problems.
    A test that can rule out local minima as the problem is to plot the norm of the gradient over time.
    If the norm of the gradient does not shrink to insignificant size, the problem is neither local minima nor any other kind of critical point. This kind of negative test can rule out local minima. In high dimensional spaces, it can be very difficult to positively establish that local minima are the problem. Many structures other than local minima also have small gradients.

Plateaus(高原), Saddle Points(鞍点) and Other Flat Regions

  • For many high-dimensional non-convex functions, local minima (and maxima) are in fact rare compared to another kind of point with zero gradient: a saddle point.
    Some points around a saddle point have greater cost than the saddle point, while others have a lower cost. At a saddle point, the Hessian matrix has both positive and negative eigenvalues. Points lying along eigenvectors associated with positive eigenvalues have greater cost than the saddle point, while points lying along negative eigenvalues have lower value. We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum along another cross-section. See figure below for an illustration.
    Optimization for Training Deep Models(1)_机器学习_02

  • Many classes of random functions exhibit the following behavior: in low dimensional spaces, local minima are common. In higher dimensional spaces, local minima are rare and saddle points are more common.
    For a function f : R n → R f: \mathbb{R}^{n} \rightarrow \mathbb{R} f:Rn→R of this type, the expected ratio of the number of saddle points to local minima grows exponentially with n n n.
    To understand the intuition behind this behavior, observe that the Hessian matrix at a local minimum has only positive eigenvalues.
    The Hessian matrix at a saddle point has a mixture of positive and negative eigenvalues.
    Imagine that the sign of each eigenvalue is generated by flipping a coin. In a single dimension, it is easy to obtain a local minimum by tossing a coin and getting heads once. In n n n -dimensional space, it is exponentially unlikely that all n n n coin tosses will be heads. See Dauphin et al. (2014) for a review of the relevant theoretical work. An amazing property of many random functions is that the eigenvalues of the Hessian become more likely to be positive as we reach regions of lower cost. In our coin tossing analogy, this means we are more likely to have our coin come up heads n n n times if we are at a critical point with low cost.
    This means that
    local minima are much more likely to have low cost than high cost.
    Critical points with high cost are far more likely to be saddle points.
    Critical points with extremely high cost are more likely to be local maxima.

  • What are the implications of the proliferation of saddle points for training algorithms? For first-order optimization algorithms that use only gradient information, the situation is unclear. The gradient can often become very small near a saddle point.
    On the other hand, gradient descent empirically seems to be able to escape saddle points in many cases. Goodfellow et al. (2015) provided visualizations of several learning trajectories of state-of-the-art neural networks, with an example given in figure 8.2 8.2 8.2.
    Optimization for Training Deep Models(1)_ide_03

  • For Newton’s method, it is clear that saddle points constitute a problem. Gradient descent is designed to move “downhill” and is not explicitly designed to seek a critical point. Newton’s method, however, is designed to solve for a point where the gradient is zero. Without appropriate modification, it can jump to a saddle point. The proliferation(增殖) of saddle points in high dimensional spaces presumably explains why second-order methods have not succeeded in replacing gradient descent for neural network training. Dauphin et al. (2014) introduced a saddle-free Newton method for second-order optimization and showed that it improves significantly over the traditional version. Second-order methods remain difficult to scale to large neural networks, but this saddle-free approach holds promise if it could be scaled.

  • There may also be wide, flat regions of constant value. In these locations, the gradient and also the Hessian are all zero. Such degenerate locations pose major problems for all numerical optimization algorithms. In a convex problem, a wide, flat region must consist entirely of global minima, but in a general optimization problem, such a region could correspond to a high value of the objective function.

Cliffs and Exploding Gradients

  • Neural networks with many layers often have extremely steep regions resembling cliffs, as illustrated in figure below. These result from the multiplication of several large weights together.

  • On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff structure altogether.
    Optimization for Training Deep Models(1)_sed_04

  • The cliff can be dangerous whether we approach it from above or from below, but fortunately its most serious consequences can be avoided using the gradient clipping heuristic described in section 10.11.1.
    The basic idea is to recall that the gradient does not specify the optimal step size, but only the optimal direction within an infinitesimal region. When the traditional gradient descent algorithm proposes to make a very large step, the gradient clipping heuristic intervenes to reduce the step size to be small enough that it is less likely to go outside the region where the gradient indicates the direction of approximately steepest descent. Cliff structures are most common in the cost functions for recurrent neural networks, because such models involve a multiplication of many factors, with one factor for each time step. Long temporal sequences thus incur an extreme amount of multiplication.

Long-Term Dependencies

  • Another difficulty that neural network optimization algorithms must overcome arises when the computational graph becomes extremely deep. Feedforward networks with many layers have such deep computational graphs. So do recurrent networks, described in chapter 10 , which construct very deep computational graphs by repeatedly applying the same operation at each time step of a long temporal sequence. Repeated application of the same parameters gives rise to especially pronounced difficulties.

  • For example, suppose that a computational graph contains a path that consists of repeatedly multiplying by a matrix W \boldsymbol{W} W. After t t t steps, this is equivalent to multiplying by W t \boldsymbol{W}^{t} Wt. Suppose that W \boldsymbol{W} W has an eigendecomposition W = V diag ⁡ ( λ ) V − 1 \boldsymbol{W}=\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda}) \boldsymbol{V}^{-1} W=Vdiag(λ)V−1.
    In this simple case, it is straightforward to see that
    W t = ( V diag ⁡ ( λ ) V − 1 ) t = V diag ⁡ ( λ ) t V − 1 \boldsymbol{W}^{t}=\left(\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda}) \boldsymbol{V}^{-1}\right)^{t}=\boldsymbol{V} \operatorname{diag}(\boldsymbol{\lambda})^{t} \boldsymbol{V}^{-1} Wt=(Vdiag(λ)V−1)t=Vdiag(λ)tV−1

  • Any eigenvalues λ i \lambda_{i} λi​ that are not near an absolute value of 1 will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.

  • The vanishing and exploding gradient problem refers to the fact that gradients through such a graph are also scaled according to diag ⁡ ( λ ) t \operatorname{diag}(\boldsymbol{\lambda})^{t} diag(λ)t.
    Vanishing gradients make it difficult to know which direction the parameters should move to improve the cost function, while exploding gradients can make learning unstable.
    The cliff structures described earlier that motivate gradient clipping are an example of the exploding gradient phenomenon.

  • Recurrent networks use the same matrix W \boldsymbol{W} W at each time step, but feedforward networks do not, so even very deep feedforward networks can largely avoid the vanishing and exploding gradient problem (Sussillo, 2014).

  • We defer a further discussion of the challenges of training recurrent networks until section 10.7 10.7 10.7, after recurrent networks have been described in more detail.

Inexact(不准确) Gradients

  • Most optimization algorithms are designed with the assumption that we have access to the exact gradient or Hessian matrix. In practice, we usually only have a noisy or even biased estimate of these quantities. Nearly every deep learning algorithm relies on sampling-based estimates at least insofar(就此而言) as using a minibatch of training examples to compute the gradient.

  • In other cases, the objective function we want to minimize is actually intractable. When the objective function is intractable, typically its gradient is intractable as well. In such cases we can only approximate the gradient. These issues mostly arise with the more advanced models in part III. For example, contrastive divergence gives a technique for approximating the gradient of the intractable log-likelihood of a Boltzmann machine.

  • Various neural network optimization algorithms are designed to account for imperfections in the gradient estimate. One can also avoid the problem by choosing a surrogate loss function that is easier to approximate than the true loss.

Poor Correspondence between Local and Global Structure

  • Many of the problems we have discussed so far correspond to properties of the loss function at a single point-it can be difficult to make a single step if J ( θ ) J(\boldsymbol{\theta}) J(θ) is poorly conditioned at the current point θ \boldsymbol{\theta} θ, or if θ \boldsymbol{\theta} θ lies on a cliff, or if θ \boldsymbol{\theta} θ is a saddle point hiding the opportunity to make progress downhill from the gradient.

  • It is possible to overcome all of these problems at a single point and still perform poorly if the direction that results in the most improvement locally does not point toward distant regions of much lower cost.

  • Goodfellow et al. (2015) argue that much of the runtime of training is due to the length of the trajectory needed to arrive at the solution. Figure below shows that the learning trajectory spends most of its time tracing out a wide arc around a mountain-shaped structure.

  • Much of research into the difficulties of optimization has focused on whether training arrives at a global minimum, a local minimum, or a saddle point, but in practice neural networks do not arrive at a critical point of any kind. Figure 8.1 8.1 8.1 shows that neural networks often do not arrive at a region of small gradient. Indeed, such critical points do not even necessarily exist.
    For example, the loss function − log ⁡ p ( y ∣ x ; θ ) -\log p(y \mid \boldsymbol{x} ; \boldsymbol{\theta}) −logp(y∣x;θ) can lack a global minimum point and instead asymptotically approach some value as the model becomes more confident.
    For a classifier with discrete y y y and p ( y ∣ x ) p(y \mid \boldsymbol{x}) p(y∣x) provided by a softmax, the negative log-likelihood can become arbitrarily close to zero if the model is able to correctly classify every example in the training set, but it is impossible to actually reach the value of zero.
    Likewise, a model of real values p ( y ∣ x ) = N ( y ; f ( θ ) , β − 1 ) p(y \mid \boldsymbol{x})=\mathcal{N}\left(y ; f(\boldsymbol{\theta}), \beta^{-1}\right) p(y∣x)=N(y;f(θ),β−1) can have negative log-likelihood that asymptotes to negative infinity-if f ( θ ) f(\boldsymbol{\theta}) f(θ) is able to correctly predict the value of all training set y y y targets, the learning algorithm will increase β \beta β without bound.
    Optimization for Training Deep Models(1)_ios

  • Many existing research directions are aimed at finding good initial points for problems that have difficult global structure, rather than developing algorithms that use non-local moves.

  • Gradient descent and essentially all learning algorithms that are effective for training neural networks are based on making small, local moves.
    The previous sections have primarily focused on how the correct direction of these local moves can be difficult to compute. We may be able to compute some properties of the objective function, such as its gradient, only approximately, with bias or variance in our estimate of the correct direction. In these cases, local descent may or may not define a reasonably short path to a valid solution, but we are not actually able to follow the local descent path.
    The objective function may have issues such as poor conditioning or discontinuous gradients, causing the region where the gradient provides a good model of the objective function to be very small.
    In these cases, local descent with steps of size ϵ \epsilon ϵ may define a reasonably short path to the solution, but we are only able to compute the local descent direction with steps of size δ ≪ ϵ . \delta \ll \epsilon . δ≪ϵ.
    In these cases, local descent may or may not define a path to the solution, but the path contains many steps, so following the path incurs a high computational cost. Sometimes local information provides us no guide, when the function has a wide flat region, or if we manage to land exactly on a critical point (usually this latter scenario only happens to methods that solve explicitly for critical points, such as Newton’s method).
    In these cases, local descent does not define a path to a solution at all.
    In other cases, local moves can be too greedy and lead us along a path that moves downhill but away from any solution, as in figure 8.4 8.4 8.4, or along an unnecessarily long trajectory to the solution, as in figure 8.2 8.2 8.2.
    Optimization for Training Deep Models(1)_ide_06

  • Currently, we do not understand which of these problems are most relevant to making neural network optimization difficult, and this is an active area of research.

  • Regardless of which of these problems are most significant, all of them might be avoided if there exists a region of space connected reasonably directly to a solution by a path that local descent can follow, and if we are able to initialize learning within that well-behaved region. This last view suggests research into choosing good initial points for traditional optimization algorithms to use.

Theoretical Limits of Optimization

  • Several theoretical results show that there are limits on the performance of any optimization algorithm we might design for neural networks (Blum and Rivest, 1992; Judd, 1989; Wolpert and MacReady, 1997). Typically these results have little bearing on the use of neural networks in practice.

  • Some theoretical results apply only to the case where the units of a neural network output discrete values. However, most neural network units output smoothly increasing values that make optimization via local search feasible.
    Some theoretical results show that there exist problem classes that are intractable, but it can be difficult to tell whether a particular problem falls into that class.
    Other results show that finding a solution for a network of a given size is intractable, but in practice we can find a solution easily by using a larger network for which many more parameter settings correspond to an acceptable solution.
    Moreover, in the context of neural network training, we usually do not care about finding the exact minimum of a function, but seek only to reduce its value sufficiently to obtain good generalization error.
    Theoretical analysis of whether an optimization algorithm can accomplish this goal is extremely difficult. Developing more realistic bounds on the performance of optimization algorithms therefore remains an important goal for machine learning research.