CONTENTS

Encoder-Decoder Sequence-to-Sequence Architectures

  • We have seen in figure 10.5 10.5 10.5 how an RNN can map an input sequence to a fixed-size vector. We have seen in figure 10.9 10.9 10.9 how an RNN can map a fixed-size vector to a sequence. We have seen in figures 10.3 , 10.4 , 10.10 10.3,10.4,10.10 10.3,10.4,10.10 and 10.11 10.11 10.11 how an RNN can map an input sequence to an output sequence of the same length.

  • Here we discuss how an RNN can be trained to map an input sequence to an output sequence which is not necessarily of the same length. This comes up in many applications, such as speech recognition, machine translation or question answering, where the input and output sequences in the training set are generally not of the same length (although their lengths might be related).

  • We often call the input to the RNN the “context.” We want to produce a representation of this context, C . C . C. The context C C C might be a vector or sequence of vectors that summarize the input sequence X = ( x ( 1 ) , … , x ( n x ) ) \boldsymbol{X}=\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{\left(n_{x}\right)}\right) X=(x(1),…,x(nx​)).
    Sequence Modeling: Recurrent and Recursive Nets(2)_机器学习

  • The simplest RNN architecture for mapping a variable-length sequence to another variable-length sequence was first proposed by Cho et al. ( 2014a ) and shortly after by Sutskever et al. (2014), who independently developed that architecture and were the first to obtain state-of-the-art translation using this approach. The former system is based on scoring proposals generated by another machine translation system, while the latter uses a standalone recurrent network to generate the translations. These authors respectively called this architecture, illustrated in figure 10.12 10.12 10.12, the encoder-decoder or sequence-to-sequence architecture. The idea is very simple:
    (1) an encoder or reader or input RNN processes the input sequence. The encoder emits the context C C C, usually as a simple function of its final hidden state.
    (2) a decoder or writer or output RNN is conditioned on that fixed-length vector (just like in figure 10.9 10.9 10.9 ) to generate the output sequence Y = ( y ( 1 ) , … , y ( n y ) ) . \boldsymbol{Y}=\left(\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{\left(n_{y}\right)}\right) . Y=(y(1),…,y(ny​)).

  • The innovation of this kind of architecture over those presented in earlier sections of this chapter is that the lengths n x n_{x} nx​ and n y n_{y} ny​ can vary from each other, while previous architectures constrained n x = n y = τ . n_{x}=n_{y}=\tau . nx​=ny​=τ. In a sequence-to-sequence architecture, the two RNNs are trained jointly to maximize the average of log ⁡ P ( y ( 1 ) , … , y ( n y ) ∣ x ( 1 ) , … , x ( n x ) ) \log P\left(\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{\left(n_{y}\right)} \mid \boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{\left(n_{x}\right)}\right) logP(y(1),…,y(ny​)∣x(1),…,x(nx​)) over all the pairs of x \boldsymbol{x} x and y \boldsymbol{y} y.In a sequences in the training set. The last state h n x \boldsymbol{h}_{n_{x}} hnx​ of the encoder RNN is typically used as a representation C C C of the input sequence that is provided as input to the decoder RNN.

  • If the context C C C is a vector, then the decoder RNN is simply a vector-tosequence RNN as described in section 10.2.4. As we have seen, there are at least two ways for a vector-to-sequence RNN to receive input.
    The input can be provided as the initial state of the RNN, or the input can be connected to the hidden units at each time step. These two ways can also be combined.

  • There is no constraint that the encoder must have the same size of hidden layer as the decoder.

  • One clear limitation of this architecture is when the context C C C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence. This phenomenon was observed by Bahdanau et al. ( 2015 ) in the context of machine translation. They proposed to make C C C a variable-length sequence rather than a fixed-size vector. Additionally, they introduced an attention mechanism that learns to associate elements of the sequence C C C to elements of the output sequence. See section 12.4.5.1 for more details.

Deep Recurrent Networks

  • The computation in most RNNs can be decomposed into three blocks of parameters and associated transformations:
  1. from the input to the hidden state,
  2. from the previous hidden state to the next hidden state, and
  3. from the hidden state to the output.
  • With the RNN architecture of figure 10.3 10.3 10.3, each of these three blocks is associated with a single weight matrix. In other words, when the network is unfolded, each of these corresponds to a shallow transformation. By a shallow transformation, we mean a transformation that would be represented by a single layer within a deep MLP.
    Typically this is a transformation represented by a learned affine transformation followed by a fixed nonlinearity.

  • Would it be advantageous to introduce depth in each of these operations? Experimental evidence (Graves et al., 2013 ; Pascanu et al., 2014a )strongly suggests so. The experimental evidence is in agreement with the idea that we need enough depth in order to perform the required mappings. See also Schmidhuber (1992), El Hihi and Bengio ( 1996 ) (1996) (1996), or Jaeger (2007a) for earlier work on deep RNNs.
    Sequence Modeling: Recurrent and Recursive Nets(2)_深度学习_02

  • Graves et al. (2013) were the first to show a significant benefit of decomposing the state of an RNN into multiple layers as in figure 10.13 10.13 10.13 (left). We can think of the lower layers in the hierarchy depicted in figure 10.13 a 10.13 \mathrm{a} 10.13a as playing a role in transforming the raw input into a representation that is more appropriate, at the higher levels of the hidden state. Pascanu et al. (2014a) go a step further and propose to have a separate MLP (possibly deep) for each of the three blocks enumerated above, as illustrated in figure 10.13b. Considerations of representational capacity suggest to allocate enough capacity in each of these three steps, but doing so by adding depth may hurt learning by making optimization difficult.

  • In general, it is easier to optimize shallower architectures, and adding the extra depth of figure 10.13   b 10.13 \mathrm{~b} 10.13 b makes the shortest path from a variable in time step t t t to a variable in time step t + 1 t+1 t+1 become longer. For example, if an MLP with a single hidden layer is used for the state-to-state transition, we have doubled the length of the shortest path between variables in any two different time steps, compared with the ordinary RNN of figure 10.3. However, as argued by Pascanu et al. (2014a), this can be mitigated by introducing skip connections in the hidden-to-hidden path, as illustrated in figure 10.13 c.

Recursive Neural Networks

Sequence Modeling: Recurrent and Recursive Nets(2)_深度学习_03

  • Recursive neural networks represent yet another generalization of recurrent networks, with a different kind of computational graph, which is structured as a deep tree, rather than the chain-like structure of RNNs. The typical computational graph for a recursive network is illustrated in figure 10.14 . Recursive neural networks were introduced by Pollack (1990) and their potential use for learning to reason was described by Bottou (2011). Recursive networks have been successfully applied to processing data structures as input to neural nets (Frasconi et al., 1997 . 1998 ), in natural language processing (Socher et al., 2011 a , c , 2013 a 2011 \mathrm{a}, \mathrm{c}, 2013 \mathrm{a} 2011a,c,2013a ) as well as in computer vision (Socher et al., 2011b)

  • One clear advantage of recursive nets over recurrent nets is that for a sequence of the same length τ \tau τ, the depth (measured as the number of compositions of nonlinear operations) can be drastically reduced from τ \tau τ to O ( log ⁡ τ ) O(\log \tau) O(logτ), which might help deal with long-term dependencies.
    An open question is how to best structure the tree. One option is to have a tree structure which does not depend on the data, such as a balanced binary tree. In some application domains, external methods can suggest the appropriate tree structure. For example, when processing natural language sentences, the tree structure for the recursive network can be fixed to the structure of the parse tree of the sentence provided by a natural language parser (Socher et al., 2011a, 2013a). Ideally, one would like the learner itself to discover and infer the tree structure that is appropriate for any given input, as suggested by Bottou (2011).

  • Many variants of the recursive net idea are possible. For example, Frasconi et al. (1997) and Frasconi et al. (1998) associate the data with a tree structure, and associate the inputs and targets with individual nodes of the tree. The computation performed by each node does not have to be the traditional artificial neuron computation (affine transformation of all inputs followed by a monotone nonlinearity).
    For example, Socher et al. (2013a) propose using tensor operations and bilinear forms, which have previously been found useful to model relationships between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are represented by continuous vectors (embeddings).

The Challenge of Long-Term Dependencies

  • The mathematical challenge of learning long-term dependencies in recurrent networks was introduced in section 8.2.5. The basic problem is that gradients propagated over many stages tend to either vanish (most of the time) or explode (rarely, but with much damage to the optimization). Even if we assume that the parameters are such that the recurrent network is stable (can store memories, with gradients not exploding), the difficulty with long-term dependencies arises from the exponentially smaller weights given to long-term interactions (involving the multiplication of many Jacobians) compared to short-term ones. Many other sources provide a deeper treatment (Hochreiter, Doya, 1994; Pascanu et al., 2013). In this section, we describe the problem in more detail. The remaining sections describe approaches to overcoming the problem.

  • Recurrent networks involve the composition of the same function multiple times, once per time step. These compositions can result in extremely nonlinear behavior, as illustrated in figure 10.15 10.15 10.15.
    Sequence Modeling: Recurrent and Recursive Nets(2)_机器学习_04

  • In particular, the function composition employed by recurrent neural networks somewhat resembles matrix multiplication. We can think of the recurrence relation
    h ( t ) = W ⊤ h ( t − 1 ) \boldsymbol{h}^{(t)}=\boldsymbol{W}^{\top} \boldsymbol{h}^{(t-1)} h(t)=Wh(t−1)
    as a very simple recurrent neural network lacking a nonlinear activation function, and lacking inputs x . \boldsymbol{x} . x. As described in section 8.2.5 8.2 .5 8.2.5, this recurrence relation essentially describes the power method. It may be simplified to
    h ( t ) = ( W t ) ⊤ h ( 0 ) \boldsymbol{h}^{(t)}=\left(\boldsymbol{W}^{t}\right)^{\top} \boldsymbol{h}^{(0)} h(t)=(Wt)h(0)
    and if W \boldsymbol{W} W admits an eigendecomposition of the form
    W = Q Λ Q ⊤ \boldsymbol{W}=\boldsymbol{Q} \mathbf{\Lambda} \boldsymbol{Q}^{\top} W=QΛQ
    with orthogonal Q \boldsymbol{Q} Q, the recurrence may be simplified further to
    h ( t ) = Q ⊤ Λ t Q h ( 0 ) \boldsymbol{h}^{(t)}=\boldsymbol{Q}^{\top} \boldsymbol{\Lambda}^{t} \boldsymbol{Q} \boldsymbol{h}^{(0)} h(t)=QΛtQh(0)

  • The eigenvalues are raised to the power of t t t causing eigenvalues with magnitude less than one to decay to zero and eigenvalues with magnitude greater than one to explode.
    Any component of h ( 0 ) \boldsymbol{h}^{(0)} h(0) that is not aligned with the largest eigenvector will eventually be discarded.

  • This problem is particular to recurrent networks. In the scalar case, imagine multiplying a weight w w w by itself many times. The product w t w^{t} wt will either vanish or explode depending on the magnitude of w w w.
    However, if we make a non-recurrent network that has a different weight w ( t ) w^{(t)} w(t) at each time step, the situation is different. If the initial state is given by 1, then the state at time t t t is given by ∏ t w ( t ) \prod_{t} w^{(t)} ∏t​w(t). Suppose that the w ( t ) w^{(t)} w(t) values are generated randomly, independently from one another, with zero mean and variance v v v. The variance of the product is O ( v n ) . O\left(v^{n}\right) . O(vn). To obtain some desired variance v ∗ v^{*} v we may choose the individual weights with variance v = v ∗ n v=\sqrt[n]{v^{*}} v=nv ​. Very deep feedforward networks with carefully chosen scaling can thus avoid the vanishing and exploding gradient problem, as argued by Sussillo (2014).

  • The vanishing and exploding gradient problem for RNNs was independently discovered by separate researchers ( Hochreiter , 1991 ; Bengio et al. , 1993 , 1994 ).One may hope that the problem can be avoided simply by staying in a region of parameter space where the gradients do not vanish or explode. Unfortunately, in order to store memories in a way that is robust to small perturbations, the RNN must enter a region of parameter space where gradients vanish( Bengio et al. , 1993 , 1994). Specifically, whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction. It does not mean that it is impossible to learn, but that it might take a very long time to learn long-term dependencies, because the signal about these dependencies will tend to be hidden by the smallest fluctuations arising from short-term dependencies. In practice, the experiments in Bengio et al. ( 1994 ) show that as we increase the span of the dependencies that need to be captured, gradient-based optimization becomes increasingly difficult, with the probability of successful training of a traditional RNN via SGD rapidly reaching 0 for sequences of only length 10 or 20 .

  • For a deeper treatment of recurrent networks as dynamical systems, see Doya ( 1993 ), Bengio et al. ( 1994 ) and Siegelmann and Sontag ( 1995 ) and Siegelmann and Sontag (1995), with a review in Pascanu et al. (2013). The remaining sections of this chapter discuss various approaches that have been proposed to reduce the difficulty of learning longterm dependencies (in some cases allowing an RNN to learn dependencies across hundreds of steps), but the problem of learning long-term dependencies remains one of the main challenges in deep learning.

Echo State Networks

  • The recurrent weights mapping from h ( t − 1 ) \boldsymbol{h}^{(t-1)} h(t−1) to h ( t ) \boldsymbol{h}^{(t)} h(t) and the input weights mapping from x ( t ) x^{(t)} x(t) to h ( t ) h^{(t)} h(t) are some of the most difficult parameters to learn in a recurrent network. One proposed (Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004; Jaeger, 2007b) approach to avoiding this difficulty is to set the recurrent weights such that the recurrent hidden units do a good job of capturing the history of past inputs, and learn only the output weights. This is the idea that was independently proposed for echo state networks or ESNs (Jaeger and Haas, 2004; Jaeger, 2007b) and liquid state machines (Maass et al., 2002). The latter is similar, except that it uses spiking neurons (with binary outputs) instead of the continuous-valued hidden units used for ESNs. Both ESNs and liquid state machines are termed reservoir computing (Lukoševičius and Jaeger, 2009) to denote the fact that the hidden units form of reservoir of temporal features which may capture different aspects of the history of inputs.

  • One way to think about these reservoir computing recurrent networks is that they are similar to kernel machines: they map an arbitrary length sequence (the history of inputs up to time t t t ) into a fixed-length vector (the recurrent state h ( t ) \boldsymbol{h}^{(t)} h(t) ), on which a linear predictor (typically a linear regression) can be applied to solve the problem of interest.
    The training criterion may then be easily designed to be convex as a function of the output weights.
    For example, if the output consists of linear regression from the hidden units to the output targets, and the training criterion is mean squared error, then it is convex and may be solved reliably with simple learning algorithms (Jaeger, 2003).

  • The important question is therefore: how do we set the input and recurrent weights so that a rich set of histories can be represented in the recurrent neural network state? The answer proposed in the reservoir computing literature is to view the recurrent net as a dynamical system, and set the input and recurrent weights such that the dynamical system is near the edge of stability.

  • The original idea was to make the eigenvalues of the Jacobian of the state-to-state transition function be close to 1 . As explained in section 8.2.5, an important characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians J ( t ) = ∂ s ( t ) ∂ s ( t − 1 ) \boldsymbol{J}^{(t)}=\frac{\partial s^{(t)}}{\partial s^{(t-1)}} J(t)=∂s(t−1)∂s(t)​. Of particular importance is the spectral radius of J ( t ) \boldsymbol{J}^{(t)} J(t), defined to be the maximum of the absolute values of its eigenvalues.

  • To understand the effect of the spectral radius, consider the simple case of back-propagation with a Jacobian matrix J \boldsymbol{J} J that does not change with t t t. This case happens, for example, when the network is purely linear. Suppose that J \boldsymbol{J} J has an eigenvector v \boldsymbol{v} v with corresponding eigenvalue λ \lambda λ. Consider what happens as we propagate a gradient vector backwards through time. If we begin with a gradient vector g \boldsymbol{g} g, then after one step of back-propagation, we will have J g \boldsymbol{J} \boldsymbol{g} Jg, and after n n n steps we will have J n g . \boldsymbol{J}^{n} \boldsymbol{g} . Jng. Now consider what happens if we instead back-propagate a perturbed version of g . \boldsymbol{g} . g. If we begin with g + δ v \boldsymbol{g}+\delta \boldsymbol{v} g+δv, then after one step, we will have J ( g + δ v ) . \boldsymbol{J}(\boldsymbol{g}+\delta \boldsymbol{v}) . J(g+δv). After n n n steps, we will have J n ( g + δ v ) \boldsymbol{J}^{n}(\boldsymbol{g}+\delta \boldsymbol{v}) Jn(g+δv). From this we can see that back-propagation starting from g \boldsymbol{g} g and back-propagation starting from g + δ v \boldsymbol{g}+\delta \boldsymbol{v} g+δv diverge by δ J n v \delta \boldsymbol{J}^{n} \boldsymbol{v} δJnv after n n n steps of back-propagation. If v \boldsymbol{v} v is chosen to be a unit eigenvector of J J J with eigenvalue λ \lambda λ, then multiplication by the Jacobian simply scales the difference at each step. The two executions of back-propagation are separated by a distance of δ ∣ λ ∣ n . \delta|\lambda|^{n} . δ∣λ∣n. When v \boldsymbol{v} v corresponds to the largest value of ∣ λ ∣ |\lambda| ∣λ∣, this perturbation achieves the widest possible separation of an initial perturbation of size δ \delta δ.

  • When ∣ λ ∣ > 1 |\lambda|>1 ∣λ∣>1, the deviation size δ ∣ λ ∣ n \delta|\lambda|^{n} δ∣λ∣n grows exponentially large. When ∣ λ ∣ < 1 |\lambda|<1 ∣λ∣<1 the deviation size becomes exponentially small.

  • Of course, this example assumed that the Jacobian was the same at every time step, corresponding to a recurrent network with no nonlinearity. When a nonlinearity is present, the derivative of the nonlinearity will approach zero on many time steps, and help to prevent the explosion resulting from a large spectral radius. Indeed, the most recent work on echo state networks advocates using a spectral radius much larger than unity (Yildiz et al., 2012; Jaeger, 2012).

  • Everything we have said about back-propagation via repeated matrix multiplication applies equally to forward propagation in a network with no nonlinearity, where the state h ( t + 1 ) = h ( t ) ⊤ W \boldsymbol{h}^{(t+1)}=\boldsymbol{h}^{(t) \top} \boldsymbol{W} h(t+1)=h(t)⊤W

  • When a linear map W ⊤ \boldsymbol{W}^{\top} W always shrinks h \boldsymbol{h} h as measured by the L 2 L^{2} L2 norm, then we say that the map is contractive. When the spectral radius is less than one, the mapping from h ( t ) \boldsymbol{h}^{(t)} h(t) to h ( t + 1 ) \boldsymbol{h}^{(t+1)} h(t+1) is contractive(收缩的), so a small change becomes smaller after each time step. This necessarily makes the network forget information about the past when we use a finite level of precision (such as 32 bit integers) to store the state vector.

  • The Jacobian matrix tells us how a small change of h ( t ) \boldsymbol{h}^{(t)} h(t) propagates one step forward, or equivalently, how the gradient on h ( t + 1 ) \boldsymbol{h}^{(t+1)} h(t+1) propagates one step backward, during back-propagation. Note that neither W \boldsymbol{W} W nor J \boldsymbol{J} J need to be symmetric (although they are square and real), so they can have complex-valued eigenvalues and eigenvectors, with imaginary components corresponding to potentially oscillatory behavior (if the same Jacobian was applied iteratively). Even though h ( t ) \boldsymbol{h}^{(t)} h(t) or a small variation of h ( t ) \boldsymbol{h}^{(t)} h(t) of interest in back-propagation are real-valued, they can be expressed in such a complex-valued basis. What matters is what happens to the magnitude (complex absolute value) of these possibly complex-valued basis coefficients, when we multiply the matrix by the vector. An eigenvalue with magnitude greater than one corresponds to magnification (exponential growth, if applied iteratively) or shrinking (exponential decay, if applied iteratively).

  • With a nonlinear map, the Jacobian is free to change at each step. The dynamics therefore become more complicated. However, it remains true that a small initial variation can turn into a large variation after several steps. One difference between the purely linear case and the nonlinear case is that the use of a squashing nonlinearity such as tanh can cause the recurrent dynamics to become bounded. Note that it is possible for back-propagation to retain unbounded dynamics even when forward propagation has bounded dynamics, for example, when a sequence of tanh units are all in the middle of their linear regime and are connected by weight matrices with spectral radius greater than 1 . However, it is rare for all of the tanh units to simultaneously lie at their linear activation point.

  • The strategy of echo state networks is simply to fix the weights to have some spectral radius such as 3 , where information is carried forward through time but does not explode due to the stabilizing effect of saturating nonlinearities like tanh.

  • More recently, it has been shown that the techniques used to set the weights in ESNs could be used to initialize the weights in a fully trainable recurrent network (with the hidden-to-hidden recurrent weights trained using back-propagation through time), helping to learn long-term dependencies (Sutskever, 2012; Sutskevo et al., 2013). In this setting, an initial spectral radius of 1.2 1.2 1.2 performs well, combined with the sparse initialization scheme described in section 8.4 8.4 8.4.