CONTENTS

Basic Algorithms

  • We have previously introduced the gradient descent (section 4.3) algorithm that follows the gradient of an entire training set downhill. This may be accelerated considerably by using stochastic gradient descent to follow the gradient of randomly selected minibatches downhill, as discussed in ​​section 5.9 5.9 5.9​​ and ​​section 8.1.3​​.

Stochastic Gradient Descent



Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning in general and for deep learning in particular. As discussed in ​​section 8.1.3​​, it is possible to obtain an unbiased estimate of the gradient by taking the average gradient on a minibatch of m m m examples drawn i.i.d from the data generating distribution.



Algorithm 8.1 8.1 8.1 shows how to follow this estimate of the gradient downhill.
Optimization for Training Deep Models(2)_ide



A crucial parameter for the SGD algorithm is the learning rate. Previously, we have described SGD as using a fixed learning rate ϵ . \epsilon . ϵ. In practice, it is necessary to gradually decrease the learning rate over time, so we now denote the learning rate at iteration k k k as ϵ k \epsilon_{k} ϵk.



This is because the SGD gradient estimator introduces a source of noise (the random sampling of m m m training examples) that does not vanish even when we arrive at a minimum.
By comparison, the true gradient of the total cost function becomes small and then 0 \mathbf{0} 0 when we approach and reach a minimum using batch gradient descent, so batch gradient descent can use a fixed learning rate.
A sufficient condition to guarantee convergence of SGD is that
∑ k = 1 ∞ ϵ k = ∞ ,  and  \sum_{k=1}^{\infty} \epsilon_{k}=\infty, \quad \text { and } k=1ϵk=, and 
∑ k = 1 ∞ ϵ k 2 < ∞ \sum_{k=1}^{\infty} \epsilon_{k}^{2}<\infty k=1ϵk2<



In practice, it is common to decay the learning rate linearly until iteration τ : \tau: τ:
ϵ k = ( 1 − α ) ϵ 0 + α ϵ τ \epsilon_{k}=(1-\alpha) \epsilon_{0}+\alpha \epsilon_{\tau} ϵk=(1α)ϵ0+αϵτ
with α = k τ . \alpha=\frac{k}{\tau} . α=τk. After iteration τ \tau τ, it is common to leave ϵ \epsilon ϵ constant. The learning rate may be chosen by trial and error, but it is usually best to choose it by monitoring learning curves that plot the objective function as a function of time.



This is more of an art than a science, and most guidance on this subject should be regarded with some skepticism. When using the linear schedule, the parameters to choose are ϵ 0 , ϵ τ \epsilon_{0}, \epsilon_{\tau} ϵ0,ϵτ, and τ . \tau . τ.
Usually τ \tau τ may be set to the number of iterations required to make a few hundred passes through the training set.
Usually ϵ τ \epsilon_{\tau} ϵτ should be set to roughly 1 % 1 \% 1% the value of ϵ 0 \epsilon_{0} ϵ0.



The main question is how to set ϵ 0 \epsilon_{0} ϵ0.
If it is too large, the learning curve will show violent oscillations(振荡), with the cost function often increasing significantly. Gentle oscillations are fine, especially if training with a stochastic cost function such as the cost function arising from the use of dropout.
If the learning rate is too low, learning proceeds slowly, and if the initial learning rate is too low, learning may become stuck with a high cost value.



Typically, the optimal initial learning rate, in terms of total training time and the final cost value, is higher than the learning rate that yields the best performance after the first 100 iterations or so. Therefore, it is usually best to monitor the first several iterations and use a learning rate that is higher than the best-performing learning rate at this time, but not so high that it causes severe instability.



The most important property of SGD and related minibatch or online gradient-based optimization is that computation time per update does not grow with the number of training examples. This allows convergence even when the number of training examples becomes very large. For a large enough dataset, SGD may converge to within some fixed tolerance of its final test set error before it has processed the entire training set.



To study the convergence rate of an optimization algorithm it is common to measure the excess error J ( θ ) − min ⁡ θ J ( θ ) J(\boldsymbol{\theta})-\min _{\boldsymbol{\theta}} J(\boldsymbol{\theta}) J(θ)minθJ(θ), which is the amount that the current cost function exceeds the minimum possible cost.
When SGD is applied to a convex problem, the excess error is O ( 1 k ) O\left(\frac{1}{\sqrt{k}}\right) O(k 1) after k k k iterations, while in the strongly convex case it is O ( 1 k ) O\left(\frac{1}{k}\right) O(k1).
These bounds cannot be improved unless extra conditions are assumed. Batch gradient descent enjoys better convergence rates than stochastic gradient descent in theory.
However, the Cramér-Rao bound (Cramér, 1946; Rao. 1945 ) states that generalization error cannot decrease faster than O ( 1 k ) O\left(\frac{1}{k}\right) O(k1).
Bottou and Bousquet ( 2008 ) (2008) (2008) argue that it therefore may not be worthwhile to pursue an optimization algorithm that converges faster than O ( 1 k ) O\left(\frac{1}{k}\right) O(k1) for machine learning tasks-faster convergence presumably corresponds to overfitting.
Moreover, the asymptotic analysis obscures many advantages that stochastic gradient descent has after a small number of steps. With large datasets, the ability of SGD to make rapid initial progress while evaluating the gradient for only very few examples outweighs its slow asymptotic convergence.



Most of the algorithms described in the remainder of this chapter achieve benefits that matter in practice but are lost in the constant factors obscured by the O ( 1 k ) O\left(\frac{1}{k}\right) O(k1) asymptotic analysis. One can also trade off the benefits of both batch and stochastic gradient descent by gradually increasing the minibatch size during the course of learning.



Momentum



While stochastic gradient descent remains a very popular optimization strategy, learning with it can sometimes be slow. The method of momentum (Polyak, 1964) is designed to accelerate learning, especially in the face of high curvature, small but consistent gradients, or noisy gradients.



The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction. The effect of momentum is illustrated in figure below.
Optimization for Training Deep Models(2)_lua_02



Formally, the momentum algorithm introduces a variable v \boldsymbol{v} v that plays the role of velocity - it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient. The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle through parameter space, according to Newton’s laws of motion.



Momentum in physics is mass times velocity. In the momentum learning algorithm, we assume unit mass, so the velocity vector v \boldsymbol{v} v may also be regarded as the momentum of the particle. A hyperparameter α ∈ [ 0 , 1 ) \alpha \in[0,1) α[0,1) determines how quickly the contributions of previous gradients exponentially decay. The update rule is given by:
v ← α v − ϵ ∇ θ ( 1 m ∑ i = 1 m L ( f ( x ( i ) ; θ ) , y ( i ) ) ) θ ← θ + v \begin{aligned} &\boldsymbol{v} \leftarrow \alpha \boldsymbol{v}-\epsilon \nabla_{\boldsymbol{\theta}}\left(\frac{1}{m} \sum_{i=1}^{m} L\left(\boldsymbol{f}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right), \boldsymbol{y}^{(i)}\right)\right) \\ &\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\boldsymbol{v} \end{aligned} vαvϵθ(m1i=1mL(f(x(i);θ),y(i)))θθ+v
The velocity v \boldsymbol{v} v accumulates the gradient elements ∇ θ ( 1 m ∑ i = 1 m L ( f ( x ( i ) ; θ ) , y ( i ) ) ) \nabla_{\boldsymbol{\theta}}\left(\frac{1}{m} \sum_{i=1}^{m} L\left(\boldsymbol{f}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}\right), \boldsymbol{y}^{(i)}\right)\right) θ(m1i=1mL(f(x(i);θ),y(i))).



The larger α \alpha α is relative to ϵ \epsilon ϵ, the more previous gradients affect the current direction. The SGD algorithm with momentum is given in algorithm 8.2 8.2 8.2.
Optimization for Training Deep Models(2)_lua_03



Previously, the size of the step was simply the norm of the gradient multiplied by the learning rate.
Now, the size of the step depends on how large and how aligned a sequence of gradients are.



The step size is largest when many successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g \boldsymbol{g} g, then it will accelerate in the direction of − g -\boldsymbol{g} g, until reaching a terminal velocity where the size of each step is
ϵ ∥ g ∥ 1 − α \frac{\epsilon\|\boldsymbol{g}\|}{1-\alpha} 1αϵg



It is thus helpful to think of the momentum hyperparameter in terms of 1 1 − α \frac{1}{1-\alpha} 1α1. For example, α = . 9 \alpha=.9 α=.9 corresponds to multiplying the maximum speed by 10 relative to the gradient descent algorithm.



Common values of α \alpha α used in practice include . 5 , . 9 .5, .9 .5,.9, and . 99 .99 .99. Like the learning rate, α \alpha α may also be adapted over time. Typically it begins with a small value and is later raised. It is less important to adapt α \alpha α over time than to shrink ϵ \epsilon ϵ over time.



The position of the particle at any point in time is given by θ ( t ) \boldsymbol{\theta}(t) θ(t). The particle experiences net force f ( t ) \boldsymbol{f}(t) f(t). This force causes the particle to accelerate:
f ( t ) = ∂ 2 ∂ t 2 θ ( t ) \boldsymbol{f}(t)=\frac{\partial^{2}}{\partial t^{2}} \boldsymbol{\theta}(t) f(t)=t22θ(t)



Rather than viewing this as a second-order differential equation of the position, we can introduce the variable v ( t ) \boldsymbol{v}(t) v(t) representing the velocity of the particle at time t t t and rewrite the Newtonian dynamics as a first-order differential equation:
v ( t ) = ∂ ∂ t θ ( t ) , \boldsymbol{v}(t)=\frac{\partial}{\partial t} \boldsymbol{\theta}(t), v(t)=tθ(t),
f ( t ) = ∂ ∂ t v ( t ) \boldsymbol{f}(t)=\frac{\partial}{\partial t} \boldsymbol{v}(t) f(t)=tv(t)



The momentum algorithm then consists of solving the differential equations via numerical simulation. A simple numerical method for solving differential equations is ​Euler’s method​, which simply consists of simulating the dynamics defined by the equation by taking small, finite steps in the direction of each gradient.



This explains the basic form of the momentum update, but what specifically are the forces?



One force is proportional to the negative gradient of the cost function: − ∇ θ J ( θ ) -\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) θJ(θ).
This force pushes the particle downhill along the cost function surface. The gradient descent algorithm would simply take a single step based on each gradient, but the Newtonian scenario used by the momentum algorithm instead uses this force to alter the velocity of the particle. We can think of the particle as being like a hockey puck sliding down an icy surface. Whenever it descends a steep part of the surface, it gathers speed and continues sliding in that direction until it begins to go uphill again.



One other force is necessary. If the only force is the gradient of the cost function, then the particle might never come to rest. Imagine a hockey puck sliding down one side of a valley and straight up the other side, oscillating back and forth forever, assuming the ice is perfectly frictionless. To resolve this problem, we add one other force, proportional to − v ( t ) -\boldsymbol{v}(t) v(t)​. In physics terminology, this force corresponds to viscous drag, as if the particle must push through a resistant medium such as syrup. This causes the particle to gradually lose energy over time and eventually converge to a local minimum.



Why do we use − v ( t ) -\boldsymbol{v}(t) v(t) and viscous drag in particular? Part of the reason to use − v ( t ) -\boldsymbol{v}(t) v(t) is mathematical convenience-an integer power of the velocity is easy to work with. However, other physical systems have other kinds of drag based on other integer powers of the velocity. For example, a particle traveling through the air experiences turbulent drag, with force proportional to the square of the velocity, while a particle moving along the ground experiences dry friction, with a force of constant magnitude. We can reject each of these options. Turbulent drag, proportional to the square of the velocity, becomes very weak when the velocity is small. It is not powerful enough to force the particle to come to rest. A particle with a non-zero initial velocity that experiences only the force of turbulent drag will move away from its initial position forever, with the distance from the starting point growing like O ( log ⁡ t ) O(\log t) O(logt). We must therefore use a lower power of the velocity. If we use a power of zero, representing dry friction, then the force is too strong. When the force due to the gradient of the cost function is small but non-zero, the constant force due to friction can cause the particle to come to rest before reaching a local minimum. Viscous drag avoids both of these problems-it is weak enough that the gradient can continue to cause motion until a minimum is reached, but strong enough to prevent motion if the gradient does not justify moving.



Nesterov Momentum



Sutskever et al. (2013) introduced a variant of the momentum algorithm that was inspired by Nesterov’s accelerated gradient method (Nesterov, 1983,2004 ). The update rules in this case are given by:
v ← α v − ϵ ∇ θ [ 1 m ∑ i = 1 m L ( f ( x ( i ) ; θ + α v ) , y ( i ) ) ] θ ← θ + v \begin{aligned} &\boldsymbol{v} \leftarrow \alpha \boldsymbol{v}-\epsilon \nabla_{\boldsymbol{\theta}}\left[\frac{1}{m} \sum_{i=1}^{m} L\left(\boldsymbol{f}\left(\boldsymbol{x}^{(i)} ; \boldsymbol{\theta}+\alpha \boldsymbol{v}\right), \boldsymbol{y}^{(i)}\right)\right] \\ &\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\boldsymbol{v} \end{aligned} vαvϵθ[m1i=1mL(f(x(i);θ+αv),y(i))]θθ+v
where the parameters α \alpha α and ϵ \epsilon ϵ play a similar role as in the standard momentum method. The difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum. The complete Nesterov momentum algorithm is presented in algorithm 8.3.



In the convex batch gradient case, Nesterov momentum brings the rate of convergence of the excess error from O ( 1 / k ) O(1 / k) O(1/k) (after k k k steps) to O ( 1 / k 2 ) O\left(1 / k^{2}\right) O(1/k2) as shown by Nesterov (1983). Unfortunately, in the stochastic gradient case, Nesterov momentum does not improve the rate of convergence.
Optimization for Training Deep Models(2)_sed_04



Parameter Initialization Strategies



Some optimization algorithms are not iterative by nature and simply solve for a solution point.
Other optimization algorithms are iterative by nature but, when applied to the right class of optimization problems, converge to acceptable solutions in an acceptable amount of time regardless of initialization.
Deep learning training algorithms usually do not have either of these luxuries.



Training algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations. Moreover, training deep models is a sufficiently difficult task that most algorithms are strongly affected by the choice of initialization.



The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.
When learning does converge, the initial point can determine how quickly learning converges and whether it converges to a point with high or low cost.
Also, points of comparable cost can have wildly varying generalization error, and the initial point can affect the generalization as well.



Modern initialization strategies are simple and heuristic. Designing improved initialization strategies is a difficult task because neural network optimization is not yet well understood. Most initialization strategies are based on achieving some nice properties when the network is initialized. However, we do not have a good understanding of which of these properties are preserved under which circumstances after learning begins to proceed.



Perhaps the only property known with complete certainty is that the initial parameters need to “break symmetry” between different units.
If two hidden units with the same activation function are connected to the same inputs, then these units must have different initial parameters. If they have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way. Even if the model or training algorithm is capable of using stochasticity to compute different updates for different units (for example, if one trains with dropout), it is usually best to initialize each unit to compute a different function from all of the other units.
This may help to make sure that no input patterns are lost in the null space of forward propagation and no gradient patterns are lost in the null space of back-propagation.



Typically, we set the biases for each unit to heuristically chosen constants, and initialize only the weights randomly. Extra parameters.



We almost always initialize all the weights in the model to values drawn randomly from a Gaussian or uniform distribution. The choice of Gaussian or uniform distribution does not seem to matter very much, but has not been exhaustively studied.



Larger initial weights will yield a stronger symmetry breaking effect, helping to avoid redundant units.
They also help to avoid losing signal during forward or back-propagation through the linear component of each layer-larger values in the matrix result in larger outputs of matrix multiplication.
Initial weights that are too large may, however, result in exploding values during forward propagation or back-propagation. In recurrent networks, large weights can also result in chaos (such extreme sensitivity to small perturbations of the input that the behavior of the deterministic forward propagation procedure appears random).
To some extent, the exploding gradient problem can be mitigated(减轻) by gradient clipping (thresholding the values of the gradients before performing a gradient descent step). Large weights may also result in extreme values that cause the activation function to saturate, causing complete loss of gradient through saturated units. These competing factors determine the ideal initial scale of the weights.



The perspectives of regularization and optimization can give very different insights into how we should initialize a network.
The optimization perspective suggests that the weights should be large enough to propagate information successfully, but some regularization concerns encourage making them smaller.



The use of an optimization algorithm such as stochastic gradient descent that makes small incremental changes to the weights and tends to halt in areas that are nearer to the initial parameters (whether due to getting stuck in a region of low gradient, or due to triggering some early stopping criterion based on overfitting) expresses a prior that the final parameters should be close to the initial parameters.



Recall from ​​section 7.8 7.8 7.8​ that gradient descent with early stopping is equivalent to weight decay for some models. In the general case, gradient descent with early stopping is not the same as weight decay, but does provide a loose analogy for thinking about the effect of initialization.
We can think of initializing the parameters θ \boldsymbol{\theta} θ to θ 0 \boldsymbol{\theta}_{0} θ0 as being similar to imposing a Gaussian prior p ( θ ) p(\boldsymbol{\theta}) p(θ) with mean θ 0 \boldsymbol{\theta}_{0} θ0. From this point of view, it makes sense to choose θ 0 \boldsymbol{\theta}_{0} θ0 to be near 0 . This prior says that it is more likely that units do not interact with each other than they do interact. Units interact only if the likelihood term of the objective function expresses a strong preference for them to interact.
On the other hand, if we initialize θ 0 \boldsymbol{\theta}_{0} θ0 to large values, then our prior specifies which units should interact with each other, and how they should interact.



Some heuristics are available for choosing the initial scale of the weights.



One heuristic is to initialize the weights of a fully connected layer with m m m inputs and n n n outputs by sampling each weight from U ( − 1 m , 1 m ) U\left(-\frac{1}{\sqrt{m}}, \frac{1}{\sqrt{m}}\right) U(m 1,m 1), while Gl (2010) suggest using the normalized initialization
W i , j ∼ U ( − 6 m + n , 6 m + n ) \mathrm{W}_{i, j} \sim U\left(-\sqrt{\frac{6}{m+n}}, \sqrt{\frac{6}{m+n}}\right) Wi,jU(m+n6 ,m+n6 )



This latter heuristic is designed to compromise between the goal of initializing all layers to have the same activation variance and the goal of initializing all layers to have the same gradient variance. The formula is derived using the assumption that the network consists only of a chain of matrix multiplications, with no nonlinearities.



Real neural networks obviously violate this assumption, but many strategies designed for the linear model perform reasonably well on its nonlinear counterparts.



Saxe et al. ( 2013 )recommend initializing to random orthogonal matrices, with a carefully chosen scaling or gain factor g g g that accounts for the nonlinearity applied at each layer. They derive specific values of the scaling factor for different types of nonlinear activation functions. This initialization scheme is also motivated by a model of a deep network as a sequence of matrix multiplies without nonlinearities. Under such a model, this initialization scheme guarantees that the total number of training iterations required to reach convergence is independent of depth.



One drawback to scaling rules that set all of the initial weights to have the same standard deviation, such as 1 m \frac{1}{\sqrt{m}} m 1, is that every individual weight becomes extremely small when the layers become large. Martens (2010) introduced an alternative initialization scheme called sparse initialization in which each unit is initialized to have exactly k k k non-zero weights. The idea is to keep the total amount of input to the unit independent from the number of inputs m m m without making the magnitude of individual weight elements shrink with m m m. Sparse initialization helps to achieve more diversity among the units at initialization time. However, it also imposes a very strong prior on the weights that are chosen to have large Gaussian values. Because it takes a long time for gradient descent to shrink “incorrect” large values, this initialization scheme can cause problems for units such as maxout units that have several filters that must be carefully coordinated with each other.



So far we have focused on the initialization of the weights. Fortunately, initialization of other parameters is typically easier.



The approach for setting the biases must be coordinated with the approach for settings the weights.
Setting the biases to zero is compatible with most weight initialization schemes.
There are a few situations where we may set some biases to non-zero values:





If a bias is for an output unit, then it is often beneficial to initialize the bias to obtain the right marginal statistics of the output. To do this, we assume that the initial weights are small enough that the output of the unit is determined only by the bias. This justifies setting the bias to the inverse of the activation function applied to the marginal statistics of the output in the training set.
For example, if the output is a distribution over classes and this distribution is a highly skewed distribution(高度偏态分布) with the marginal probability of class i i i given by element c i c_{i} ci of some vector c \boldsymbol{c} c, then we can set the bias vector b \boldsymbol{b} b by solving the equation softmax ⁡ ( b ) = c . \operatorname{softmax}(\boldsymbol{b})=\boldsymbol{c} . softmax(b)=c.



Sometimes we may want to choose the bias to avoid causing too much saturation at initialization.
For example, we may set the bias of a ReLU hidden unit to 0.1 0.1 0.1 rather than 0 to avoid saturating the ReLU at initialization. This approach is not compatible with weight initialization schemes that do not expect strong input from the biases though.
For example, it is not recommended for use with random walk initialization (Sussillo, 2014).



Sometimes a unit controls whether other units are able to participate in a function. In such situations, we have a unit with output u u u and another unit h ∈ [ 0 , 1 ] h \in[0,1] h[0,1], and they are multiplied together to produce an output u h . u h . uh. We can view h h h as a gate that determines whether u h ≈ u u h \approx u uhu or u h ≈ 0 u h \approx 0 uh0. In these situations, we want to set the bias for h h h so that h ≈ 1 h \approx 1 h1 most of the time at initialization. Otherwise u u u does not have a chance to learn. For example, Jozefowicz et al. (2015) advocate setting the bias to 1 for the forget gate of the LSTM model, described in section 10.10 10.10 10.10.




  • Another common type of parameter is a variance or precision parameter. For example, we can perform linear regression with a conditional variance estimate using the model
    p ( y ∣ x ) = N ( y ∣ w T x + b , 1 / β ) p(y \mid \boldsymbol{x})=\mathcal{N}\left(y \mid \boldsymbol{w}^{T} \boldsymbol{x}+b, 1 / \beta\right) p(yx)=N(ywTx+b,1/β)
    where β \beta β is a precision parameter. We can usually initialize variance or precision parameters to 1 safely.
  • Another approach is to assume the initial weights are close enough to zero that the biases may be set while ignoring the effect of the weights, then set the biases to produce the correct marginal mean of the output, and set the variance parameters to the marginal variance of the output in the training set.
  • Besides these simple constant or random methods of initializing model parameters, it is possible to initialize model parameters using machine learning. A common strategy discussed in part III of this book is to initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs. One can also perform supervised training on a related task. Even performing supervised training on an unrelated task can sometimes yield an initialization that offers faster convergence than a random initialization.

Algorithms with Adaptive Learning Rates



Neural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most difficult to set because it has a significant impact on model performance.



The delta-bar-delta algorithm (Jacobs, 1988 ) is an early heuristic approach to adapting individual learning rates for model parameters during training. The approach is based on a simple idea:
If the partial derivative of the loss, with respect to a given model parameter, remains the same sign, then the learning rate should increase.
If the partial derivative with respect to that parameter changes sign, then the learning rate should decrease.
Of course, this kind of rule can only be applied to full batch optimization.



More recently, a number of incremental (or mini-batch-based) methods have been introduced that adapt the learning rates of model parameters. This section will briefly review a few of these algorithms.



AdaGrad



The AdaGrad algorithm, shown in algorithm 8.4, individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values (Duchi et al., 2011).
The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate,
while parameters with small partial derivatives have a relatively small decrease in their learning rate.
The net effect is greater progress in the more gently sloped directions of parameter space.
Optimization for Training Deep Models(2)_机器学习_05



In the context of convex optimization, the AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found that-for training deep neural network models-the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate. AdaGrad performs well for some but not all deep learning models.



RMSProp



The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.



AdaGrad is designed to converge rapidly when applied to a convex function. When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different structures and eventually arrive at a region that is a locally convex bowl. AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure.



RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl.



RMSProp is shown in its standard form in algorithm 8.5 8.5 8.5 and combined with Nesterov momentum in algorithm 8.6. Compared to AdaGrad, the use of the moving average introduces a new hyperparameter, ρ \rho ρ, that controls the length scale of the moving average.
Optimization for Training Deep Models(2)_lua_06
Optimization for Training Deep Models(2)_sed_07



Empirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks. It is currently one of the go-to optimization methods being employed routinely by deep learning practitioners.



Adam


  • Adam (Kingma and Ba, 2014 ) is yet another adaptive learning rate optimization algorithm and is presented in algorithm 8.7. The name “Adam” derives from the phrase “adaptive moments.” In the context of the earlier algorithms, it is perhaps best seen as a variant on the combination of RMSProp and momentum with a few important distinctions.
  • First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is to apply momentum to the rescaled gradients. The use of momentum in combination with rescaling does not have a clear theoretical motivation.
    Second, Adam includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origin (see algorithm 8.7). RMSProp also incorporates an estimate of the (uncentered) second-order moment, however it lacks the correction factor. Thus, unlike in Adam, the RMSProp second-order moment estimate may have high bias early in training. Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default.
    Optimization for Training Deep Models(2)_sed_08

Choosing the Right Optimization Algorithm


  • Unfortunately, there is currently no consensus on this point. Schaul et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide range of learning tasks. While the results suggest that the family of algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) performed fairly robustly, no single best algorithm has emerged.
  • Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).