CONTENTS
  • Many of the research frontiers in deep learning involve building a probabilistic model of the input, p model  ( x ) p_{\text {model }}(\boldsymbol{x}) pmodel ​(x). Such a model can, in principle, use probabilistic inference to predict any of the variables in its environment given any of the other variables. Many of these models also have latent variables h \boldsymbol{h} h, with p model  ( x ) = E h p model  ( x ∣ h ) p_{\text {model }}(\boldsymbol{x})=\mathbb{E}_{\boldsymbol{h}} p_{\text {model }}(\boldsymbol{x} \mid \boldsymbol{h}) pmodel ​(x)=Eh​pmodel ​(x∣h).

  • These latent variables provide another means of representing the data. Distributed representations based on latent variables can obtain all of the advantages of representation learning that we have seen with deep feedforward and recurrent networks.

  • In this chapter, we describe some of the simplest probabilistic models with latent variables: linear factor models. These models are sometimes used as building blocks of mixture models (Hinton et al., 1995a; Ghahramani and Hinton, 1996 ; Roweis et al., 2002) or larger, deep probabilistic models (Tang et al., 2012). They also show many of the basic approaches necessary to build generative models that the more advanced deep models will extend further.

  • A linear factor model is defined by the use of a stochastic, linear decoder function that generates x \boldsymbol{x} x by adding noise to a linear transformation of h \boldsymbol{h} h.

  • These models are interesting because they allow us to discover explanatory factors that have a simple joint distribution. The simplicity of using a linear decoder made these models some of the first latent variable models to be extensively studied.

  • A linear factor model describes the data generation process as follows.
    First, we sample the explanatory factors h h h from a distribution
    h ∼ p ( h ) \mathbf{h} \sim p(\boldsymbol{h}) h∼p(h)
    where p ( h ) p(\boldsymbol{h}) p(h) is a factorial distribution(A factorial distribution happens when a set of variables are independent events.), with p ( h ) = ∏ i p ( h i ) p(\boldsymbol{h})=\prod_{i} p\left(h_{i}\right) p(h)=∏i​p(hi​), so that it is easy to sample from.
    Next we sample the real-valued observable variables given the factors:
    x = W h + b + n o i s e \boldsymbol{x}=\boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}+noise x=Wh+b+noise
    where the noise is typically Gaussian and diagonal (independent across dimensions). This is illustrated in figure 13.1 13.1 13.1.
    Linear Factor Models_深度学习

Probabilistic PCA and Factor Analysis

  • Probabilistic PCA (principal components analysis), factor analysis and other linear factor models are special cases of the above equations and only differ in the choices made for the noise distribution and the model’s prior over latent variables h \boldsymbol{h} h before observing x \boldsymbol{x} x.

  • In factor analysis(a technique that is used to reduce a large number of variables into fewer numbers of factors) (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior is just the unit variance Gaussian
    h ∼ N ( h ; 0 , I ) \mathbf{h} \sim \mathcal{N}(\boldsymbol{h} ; \mathbf{0}, \boldsymbol{I}) h∼N(h;0,I)
    while the observed variables x i x_{i} xi​ are assumed to be conditionally independent, given h h h. Specifically, the noise is assumed to be drawn from a diagonal covariance Gaussian distribution, with covariance matrix ψ = diag ⁡ ( σ 2 ) \psi=\operatorname{diag}\left(\sigma^{2}\right) ψ=diag(σ2), with σ 2 = [ σ 1 2 , σ 2 2 , … , σ n 2 ] ⊤ \boldsymbol{\sigma}^{2}=\left[\sigma_{1}^{2}, \sigma_{2}^{2}, \ldots, \sigma_{n}^{2}\right]^{\top} σ2=[σ12​,σ22​,…,σn2​] a vector of per-variable variances.

  • The role of the latent variables is thus to capture the dependencies between the different observed variables x i x_{i} xi​. Indeed, it can easily be shown that x \boldsymbol{x} x is just a multivariate normal random variable, with
    x ∼ N ( x ; b , W W ⊤ + ψ ) \mathbf{x} \sim \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{b}, \boldsymbol{W} \boldsymbol{W}^{\top}+\boldsymbol{\psi}\right) x∼N(x;b,WW+ψ)

  • In order to cast PCA in a probabilistic framework, we can make a slight modification to the factor analysis model, making the conditional variances σ i 2 \sigma_{i}^{2} σi2​ equal to each other. In that case the covariance of x \boldsymbol{x} x is just W W ⊤ + σ 2 I \boldsymbol{W} \boldsymbol{W}^{\top}+\sigma^{2} \boldsymbol{I} WW2I, where σ 2 \sigma^{2} σ2 is now a scalar. This yields the conditional distribution
    x ∼ N ( x ; b , W W ⊤ + σ 2 I ) \mathbf{x} \sim \mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{b}, \boldsymbol{W} \boldsymbol{W}^{\top}+\sigma^{2} \boldsymbol{I}\right) x∼N(x;b,WW2I)
    or equivalently
    x = W h + b + σ z \mathbf{x}=\boldsymbol{W} \mathbf{h}+\boldsymbol{b}+\sigma \mathbf{z} x=Wh+b+σz
    where z ∼ N ( z ; 0 , I ) \mathbf{z} \sim \mathcal{N}(\boldsymbol{z} ; \mathbf{0}, \boldsymbol{I}) z∼N(z;0,I) is Gaussian noise. Tipping and Bishop ( 1999 ) (1999) (1999) then show an iterative EM algorithm for estimating the parameters W \boldsymbol{W} W and σ 2 \sigma^{2} σ2.

  • This probabilistic PCA model takes advantage of the observation that most variations in the data can be captured by the latent variables h \boldsymbol{h} h, up to some small residual reconstruction error σ 2 \sigma^{2} σ2. As shown by Tipping and Bishop (1999), probabilistic PCA becomes PCA as σ → 0 \sigma \rightarrow 0 σ→0. In that case, the conditional expected value of h h h given x x x becomes an orthogonal projection of x − b x-b x−b onto the space spanned by the d d d columns of W \boldsymbol{W} W, like in PCA.

  • As σ → 0 \sigma \rightarrow 0 σ→0, the density model defined by probabilistic PCA becomes very sharp around these d d d dimensions spanned by the columns of W \boldsymbol{W} W. This can make the model assign very low likelihood to the data if the data does not actually cluster near a hyperplane.

Independent Component Analysis (ICA)

  • Independent component analysis (ICA) is among the oldest representation learning algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyvärinen, 1999; Hyvärinen et al., 2001a; Hinton et al., 2001; Teh et al., 2003). It is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added together to form the observed data. These signals are intended to be fully independent, rather than merely decorrelated from each other.

  • Many different specific methodologies are referred to as ICA. The variant that is most similar to the other generative models we have described here is a variant (Pham et al., 1992) that trains a fully parametric generative model. The prior distribution over the underlying factors, p ( h ) p(\boldsymbol{h}) p(h), must be fixed ahead of time by the user. The model then deterministically generates x = W h \boldsymbol{x}=\boldsymbol{W} \boldsymbol{h} x=Wh. We can perform a nonlinear change of variables ( p x ( x ) = p y ( g ( x ) ) ∣ det ⁡ ( ∂ g ( x ) ∂ x ) ∣ p_{x}(\boldsymbol{x})=p_{y}(g(\boldsymbol{x}))\left|\operatorname{det}\left(\frac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)\right| px​(x)=py​(g(x))​det(∂x∂g(x)​)​) to determine p ( x ) p(\boldsymbol{x}) p(x). Learning the model then proceeds as usual, using maximum likelihood.

  • The motivation for this approach is that by choosing p ( h ) p(\boldsymbol{h}) p(h) to be independent, we can recover underlying factors that are as close as possible to independent. This is commonly used, not to capture high-level abstract causal factors, but to recover low-level signals that have been mixed together. In this setting, each training example is one moment in time, each x i x_{i} xi​ is one sensor’s observation of the mixed signals, and each h i h_{i} hi​ is one estimate of one of the original signals. For example, we might have n n n people speaking simultaneously. If we have n n n different microphones placed in different locations, ICA can detect the changes in the volume between each speaker as heard by each microphone, and separate the signals so that each h i h_{i} hi​ contains only one person speaking clearly.

  • This is commonly used in neuroscience for electroencephalography(脑电图), a technology for recording electrical signals originating in the brain. Many electrode sensors placed on the subject’s head are used to measure many electrical signals coming from the body. The experimenter is typically only interested in signals from the brain, but signals from the subject’s heart and eyes are strong enough to confound(混淆) measurements taken at the subject’s scalp. The signals arrive at the electrodes mixed together, so ICA is necessary to separate the electrical signature of the heart from the signals originating in the brain, and to separate signals in different brain regions from each other.

  • As mentioned before, many variants of ICA are possible. Some add some noise in the generation of x \boldsymbol{x} x rather than using a deterministic decoder. Most do not use the maximum likelihood criterion, but instead aim to make the elements of h = W − 1 x \boldsymbol{h}=\boldsymbol{W}^{-1} \boldsymbol{x} h=W−1x independent from each other. Many criteria that accomplish this goal are possible. Equation p x ( x ) = p y ( g ( x ) ) ∣ det ⁡ ( ∂ g ( x ) ∂ x ) ∣ p_{x}(\boldsymbol{x})=p_{y}(g(\boldsymbol{x}))\left|\operatorname{det}\left(\frac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)\right| px​(x)=py​(g(x))​det(∂x∂g(x)​)​ requires taking the determinant of W \boldsymbol{W} W, which can be an expensive and numerically unstable operation. Some variants of ICA avoid this problematic operation by constraining W \boldsymbol{W} W to be orthogonal.

  • All variants of ICA require that p ( h ) p(\boldsymbol{h}) p(h) be non-Gaussian. This is because if p ( h ) p(\boldsymbol{h}) p(h) is an independent prior with Gaussian components, then W \boldsymbol{W} W is not identifiable. We can obtain the same distribution over p ( x ) p(\boldsymbol{x}) p(x) for many values of W \boldsymbol{W} W. This is very different from other linear factor models like probabilistic PCA and factor analysis, that often require p ( h ) p(\boldsymbol{h}) p(h) to be Gaussian in order to make many operations on the model have closed form solutions. In the maximum likelihood approach where the user explicitly specifies the distribution, a typical choice is to use p ( h i ) = d d h i σ ( h i ) p\left(h_{i}\right)=\frac{d}{d h_{i}} \sigma\left(h_{i}\right) p(hi​)=dhid​σ(hi​). Typical choices of these non-Gaussian distributions have larger peaks near 0 than does the Gaussian distribution, so we can also see most implementations of ICA as learning sparse features.

  • Many variants of ICA are not generative models in the sense that we use the phrase.
    In this book, a generative model either represents p ( x ) p(\boldsymbol{x}) p(x) or can draw samples from it. Many variants of ICA only know how to transform between x \boldsymbol{x} x and h \boldsymbol{h} h, but do not have any way of representing p ( h ) p(\boldsymbol{h}) p(h), and thus do not impose a distribution over p ( x ) p(\boldsymbol{x}) p(x). For example, many ICA variants aim to increase the sample kurtosis(峰度) of h = W − 1 x \boldsymbol{h}=\boldsymbol{W}^{-1} \boldsymbol{x} h=W−1x, because high kurtosis indicates that p ( h ) p(\boldsymbol{h}) p(h) is non-Gaussian, but this is accomplished without explicitly representing p ( h ) p(\boldsymbol{h}) p(h). This is because ICA is more often used as an analysis tool for separating signals, rather than for generating data or estimating its density.

  • Just as PCA can be generalized to the nonlinear autoencoders described in chapter 14, ICA can be generalized to a nonlinear generative model, in which we use a nonlinear function f f f to generate the observed data. See Hv id Pajunen (1999) for the initial work on nonlinear ICA and its successful use with ensemble learning by Roberts and Everson (2001) and Lappalainen et al. (2000)
    Another nonlinear extension of ICA is the approach of nonlinear independent components estimation, or NICE (Dinh et al., 2014), which stacks a series of invertible transformations (encoder stages) that have the property that the determinant of the Jacobian of each transformation can be computed efficiently. This makes it possible to compute the likelihood exactly and, like ICA, attempts to transform the data into a space where it has a factorized marginal distribution, but is more likely to succeed thanks to the nonlinear encoder. Because the encoder is associated with a decoder that is its perfect inverse, it is straightforward to generate samples from the model (by first sampling from p ( h ) p(\boldsymbol{h}) p(h) and then applying the decoder).

  • Another generalization of ICA is to learn groups of features, with statistical dependence allowed within a group but discouraged between groups ( H ) (\mathrm{H}) (H)(Hyvärinen and Hoyer , 1999 ; Hyvärinen et al., 2001b ).When the groups of related units are chosen to be non-overlapping, this is called independent subspace analysis. It is also possible to assign spatial coordinates to each hidden unit and form overlapping groups of spatially neighboring units. This encourages nearby units to learn similar features. When applied to natural images, this topographic ICA approach learns Gabor filters, such that neighboring features have similar orientation, location or frequency. Many different phase offsets of similar Gabor functions occur within each region, so that pooling over small regions yields translation invariance.

Slow Feature Analysis

  • Slow feature analysis (SFA) is a linear factor model that uses information from time signals to learn invariant features (Wiskott and Sejnowski, 2002). Slow feature analysis is motivated by a general principle called the slowness principle. The idea is that the important characteristics of scenes change very slowly compared to the individual measurements that make up a description of a scene. For example, in computer vision, individual pixel values can change very rapidly. If a zebra moves from left to right across the image, an individual pixel will rapidly change from black to white and back again as the zebra’s stripes pass over the pixel. By comparison, the feature indicating whether a zebra is in the image will not change at all, and the feature describing the zebra’s position will change slowly. We therefore may wish to regularize our model to learn features that change slowly over time.

  • The slowness principle predates slow feature analysis and has been applied to a wide variety of models (Hinton, 1989; Földiák, 1989; Mobahi et al., 2009; Bergstra and Bengio, 2009). In general, we can apply the slowness principle to any differentiable model trained with gradient descent. The slowness principle may be introduced by adding a term to the cost function of the form
    λ ∑ t L ( f ( x ( t + 1 ) ) , f ( x ( t ) ) ) \lambda \sum_{t} L\left(f\left(\boldsymbol{x}^{(t+1)}\right), f\left(\boldsymbol{x}^{(t)}\right)\right) λt​L(f(x(t+1)),f(x(t)))
    where λ \lambda λ is a hyperparameter determining the strength of the slowness regularization term, t t t is the index into a time sequence of examples, f f f is the feature extractor to be regularized, and L L L is a loss function measuring the distance between f ( x ( t ) ) f\left(\boldsymbol{x}^{(t)}\right) f(x(t)) and f ( x ( t + 1 ) ) . f\left(\boldsymbol{x}^{(t+1)}\right) . f(x(t+1)). A common choice for L L L is the mean squared difference.

  • Slow feature analysis is a particularly efficient application of the slowness principle. It is efficient because it is applied to a linear feature extractor, and can thus be trained in closed form. Like some variants of ICA, SFA is not quite a generative model per se(本身), in the sense that it defines a linear map between input space and feature space but does not define a prior over feature space and thus does not impose a distribution p ( x ) p(\boldsymbol{x}) p(x) on input space.

  • The SFA algorithm (Wiskott and Sejnowski, 2002) consists of defining f ( x ; θ ) f(\boldsymbol{x} ; \boldsymbol{\theta}) f(x;θ) to be a linear transformation, and solving the optimization problem
    min ⁡ θ E t ( f ( x ( t + 1 ) ) i − f ( x ( t ) ) i ) 2 \min _{\boldsymbol{\theta}} \mathbb{E}_{t}\left(f\left(\boldsymbol{x}^{(t+1)}\right)_{i}-f\left(\boldsymbol{x}^{(t)}\right)_{i}\right)^{2} θmin​Et​(f(x(t+1))i​−f(x(t))i​)2
    subject to the constraints
    E t f ( x ( t ) ) i = 0 \mathbb{E}_{t} f\left(\boldsymbol{x}^{(t)}\right)_{i}=0 Et​f(x(t))i​=0
    and
    E t [ f ( x ( t ) ) i 2 ] = 1 \mathbb{E}_{t}\left[f\left(\boldsymbol{x}^{(t)}\right)_{i}^{2}\right]=1 Et​[f(x(t))i2​]=1

  • The constraint that the learned feature have zero mean is necessary to make the problem have a unique solution; otherwise we could add a constant to all feature values and obtain a different solution with equal value of the slowness objective. The constraint that the features have unit variance is necessary to prevent the pathological(病理性的) solution where all features collapse to 0. 0 . 0. Like PCA, the SFA features are ordered, with the first feature being the slowest. To learn multiple features, we must also add the constraint
    ∀ i < j , E t [ f ( x ( t ) ) i f ( x ( t ) ) j ] = 0 \forall i<j, \mathbb{E}_{t}\left[f\left(\boldsymbol{x}^{(t)}\right)_{i} f\left(\boldsymbol{x}^{(t)}\right)_{j}\right]=0 ∀i<j,Et​[f(x(t))i​f(x(t))j​]=0

  • This specifies that the learned features must be linearly decorrelated from each other. Without this constraint, all of the learned features would simply capture the one slowest signal.
    One could imagine using other mechanisms, such as minimizing reconstruction error, to force the features to diversify, but this decorrelation mechanism admits a simple solution due to the linearity of SFA features. The SFA problem may be solved in closed form by a linear algebra package.

  • SFA is typically used to learn nonlinear features by applying a nonlinear basis expansion to x \boldsymbol{x} x before running SFA. For example, it is common to replace x \boldsymbol{x} x by the quadratic basis expansion, a vector containing elements x i x j x_{i} x_{j} xi​xj​ for all i i i and j j j. Linear SFA modules may then be composed to learn deep nonlinear slow feature extractors by repeatedly learning a linear SFA feature extractor, applying a nonlinear basis expansion to its output, and then learning another linear SFA feature extractor on top of that expansion.

  • When trained on small spatial patches of videos of natural scenes, SFA with quadratic basis expansions learns features that share many characteristics with those of complex cells in V1 cortex (Berkes and Wiskott , 2005 ). When trained on videos of random motion within 3-D computer rendered environments, deep SFA learns features that share many characteristics with the features represented by neurons in rat brains that are used for navigation (Fra al. 2007). SFA thus seems to be a reasonably biologically plausible model.

  • A major advantage of SFA is that it is possibly to theoretically predict which features SFA will learn, even in the deep, nonlinear setting. To make such theoretical predictions, one must know about the dynamics of the environment in terms of configuration space (e.g., in the case of random motion in the 3-D rendered environment, the theoretical analysis proceeds from knowledge of the probability distribution over position and velocity of the camera). Given the knowledge of how the underlying factors actually change, it is possible to analytically solve for the optimal functions expressing these factors. In practice, experiments with deep SFA applied to simulated data seem to recover the theoretically predicted functions.

  • This is in comparison to other learning algorithms where the cost function depends highly on specific pixel values, making it much more difficult to determine what features the model will learn.

  • Deep SFA has also been used to learn features for object recognition and pose estimation (Franzius et al., 2008). So far, the slowness principle has not become the basis for any state of the art applications. It is unclear what factor has limited its performance. We speculate that perhaps the slowness prior is too strong, and that, rather than imposing a prior that features should be approximately constant, it would be better to impose a prior that features should be easy to predict from one time step to the next. The position of an object is a useful feature regardless of whether the object’s velocity is high or low, but the slowness principle encourages the model to ignore the position of objects that have high velocity.

Sparse Coding

  • Sparse coding (Olshausen and Field, 1996) is a linear factor model that has been heavily studied as an unsupervised feature learning and feature extraction mechanism.
    Strictly speaking, the term “sparse coding” refers to the process of inferring the value of h \boldsymbol{h} h in this model,
    while “sparse modeling” refers to the process of designing and learning the model,
    but the term “sparse coding” is often used to refer to both.

  • Like most other linear factor models, it uses a linear decoder plus noise to obtain reconstructions of x \boldsymbol{x} x, as specified in equation x = W h + b + n o i s e \boldsymbol{x}=\boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}+noise x=Wh+b+noise
    More specifically, sparse coding models typically assume that the linear factors have Gaussian noise with isotropic precision β \beta β :
    p ( x ∣ h ) = N ( x ; W h + b , 1 β I ) p(\boldsymbol{x} \mid \boldsymbol{h})=\mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}, \frac{1}{\beta} \boldsymbol{I}\right) p(x∣h)=N(x;Wh+b,β1​I)

  • The distribution p ( h ) p(\boldsymbol{h}) p(h) is chosen to be one with sharp peaks near 0 (Olshausen and Field, 1996). Common choices include factorized Laplace, Cauchy or factorized Student- t t t distributions. For example, the Laplace prior parametrized in terms of the sparsity penalty coefficient λ \lambda λ is given by
    p ( h i ) = Laplace ⁡ ( h i ; 0 , 2 λ ) = λ 4 e − 1 2 λ ∣ h i ∣ p\left(h_{i}\right)=\operatorname{Laplace}\left(h_{i} ; 0, \frac{2}{\lambda}\right)=\frac{\lambda}{4} e^{-\frac{1}{2} \lambda\left|h_{i}\right|} p(hi​)=Laplace(hi​;0,λ2​)=4λ​e21​λ∣hi​∣
    and the Student- t t t prior by
    p ( h i ) ∝ 1 ( 1 + h i 2 ν ) ν + 1 2 p\left(h_{i}\right) \propto \frac{1}{\left(1+\frac{h_{i}^{2}}{\nu}\right)^{\frac{\nu+1}{2}}} p(hi​)∝(1+νhi2​)2ν+11

  • Training sparse coding with maximum likelihood is intractable. Instead, the training alternates between encoding the data and training the decoder to better reconstruct the data given the encoding. This approach will be justified further as a principled approximation to maximum likelihood later, in section 19.3 19.3 19.3.

  • For models such as PCA, we have seen the use of a parametric encoder function that predicts h \boldsymbol{h} h and consists only of multiplication by a weight matrix. The encoder that we use with sparse coding is not a parametric encoder. Instead, the encoder is an optimization algorithm, that solves an optimization problem in which we seek the single most likely code value:
    h ∗ = f ( x ) = arg ⁡ max ⁡ h p ( h ∣ x ) \boldsymbol{h}^{*}=f(\boldsymbol{x})=\underset{\boldsymbol{h}}{\arg \max } p(\boldsymbol{h} \mid \boldsymbol{x}) h=f(x)=hargmax​p(h∣x)

  • When combined with equation p ( x ∣ h ) = N ( x ; W h + b , 1 β I ) p(\boldsymbol{x} \mid \boldsymbol{h})=\mathcal{N}\left(\boldsymbol{x} ; \boldsymbol{W} \boldsymbol{h}+\boldsymbol{b}, \frac{1}{\beta} \boldsymbol{I}\right) p(x∣h)=N(x;Wh+b,β1​I) and equation p ( h i ) = Laplace ⁡ ( h i ; 0 , 2 λ ) = λ 4 e − 1 2 λ ∣ h i ∣ p\left(h_{i}\right)=\operatorname{Laplace}\left(h_{i} ; 0, \frac{2}{\lambda}\right)=\frac{\lambda}{4} e^{-\frac{1}{2} \lambda\left|h_{i}\right|} p(hi​)=Laplace(hi​;0,λ2​)=4λ​e21​λ∣hi​∣, this yields the following optimization problem:
    arg ⁡ max ⁡ h p ( h ∣ x ) = arg ⁡ max ⁡ h log ⁡ p ( h ∣ x ) = arg ⁡ min ⁡ h λ ∥ h ∥ 1 + β ∥ x − W h ∥ 2 2 \begin{aligned} &\underset{\boldsymbol{h}}{\arg \max } p(\boldsymbol{h} \mid \boldsymbol{x}) \\ &=\underset{\boldsymbol{h}}{\arg \max } \log p(\boldsymbol{h} \mid \boldsymbol{x}) \\ &=\underset{\boldsymbol{h}}{\arg \min } \lambda\|\boldsymbol{h}\|_{1}+\beta\|\boldsymbol{x}-\boldsymbol{W} \boldsymbol{h}\|_{2}^{2} \end{aligned} ​hargmax​p(h∣x)=hargmax​logp(h∣x)=hargmin​λ∥h∥1​+β∥x−Wh∥22
    where we have dropped terms not depending on h \boldsymbol{h} h and divided by positive scaling factors to simplify the equation.

  • Due to the imposition of an L 1 L^{1} L1 norm on h \boldsymbol{h} h, this procedure will yield a sparse h ∗ \boldsymbol{h}^{*} h (See section 7.1.2).

  • To train the model rather than just perform inference, we alternate between minimization with respect to h h h and minimization with respect to W \boldsymbol{W} W. In this presentation, we treat β \beta β as a hyperparameter. Typically it is set to 1 because its role in this optimization problem is shared with λ \lambda λ and there is no need for both hyperparameters. In principle, we could also treat β \beta β as a parameter of the model and learn it. Our presentation here has discarded some terms that do not depend on h \boldsymbol{h} h but do depend on β \beta β. To learn β \beta β, these terms must be included, or β \beta β will collapse to 0 .

  • Not all approaches to sparse coding explicitly build a p ( h ) p(\boldsymbol{h}) p(h) and a p ( x ∣ h ) p(\boldsymbol{x} \mid \boldsymbol{h}) p(x∣h). Often we are just interested in learning a dictionary of features with activation values that will often be zero when extracted using this inference procedure.

  • If we sample h h h from a Laplace prior, it is in fact a zero probability event for an element of h \boldsymbol{h} h to actually be zero. The generative model itself is not especially sparse, only the feature extractor is. Goodfellow et al. (2013d) describe approximate inference in a different model family, the spike and slab sparse coding model, for which samples from the prior usually contain true zeros.

  • The sparse coding approach combined with the use of the non-parametric encoder can in principle minimize the combination of reconstruction error and log-prior better than any specific parametric encoder. Another advantage is that there is no generalization error to the encoder. A parametric encoder must learn how to map x x x to h h h in a way that generalizes. For unusual x x x that do not resemble the training data, a learned, parametric encoder may fail to find an h h h that results in accurate reconstruction or a sparse code. For the vast majority of formulations of sparse coding models, where the inference problem is convex, the optimization procedure will always find the optimal code (unless degenerate cases such as replicated weight vectors occur). Obviously, the sparsity and reconstruction costs can still rise on unfamiliar points, but this is due to generalization error in the decoder weights, rather than generalization error in the encoder. The lack of generalization error in sparse coding’s optimization-based encoding process may result in better generalization when sparse coding is used as a feature extractor for a classifier than when a parametric function is used to predict the code. the Coates and Ng ( 2011 )demonstrated that sparse coding features generalize better for object recognition tasks than the features of a related model based on a parametric encoder, the linear-sigmoid autoencoder. Inspired by their work,Goodfellow et al. ( 2013d )showed that a variant of sparse coding generalizes better than other feature extractors in the regime where extremely few labels are available (twenty or fewer labels per class).

  • The primary disadvantage of the non-parametric encoder is that it requires greater time to compute h \boldsymbol{h} h given x \boldsymbol{x} x because the non-parametric approach requires running an iterative algorithm. The parametric autoencoder approach, developed in chapter 14, uses only a fixed number of layers, often only one. Another disadvantage is that it is not straight-forward to back-propagate through the non-parametric encoder, which makes it difficult to pretrain a sparse coding model with an unsupervised criterion and then fine-tune it using a supervised criterion. Modified versions of sparse coding that permit approximate derivatives do exist but are not widely used ( Bagnell and Bradley , 2009 ).

  • Sparse coding, like other linear factor models, often produces poor samples, as shown in figure 13.2 13.2 13.2. This happens even when the model is able to reconstruct the data well and provide useful features for a classifier. The reason is that each individual feature may be learned well, but the factorial prior on the hidden code results in the model including random subsets of all of the features in each generated sample. This motivates the development of deeper models that can impose a non-factorial distribution on the deepest code layer, as well as the development of more sophisticated shallow models.
    Linear Factor Models_机器学习_02

Manifold Interpretation of PCA

  • Linear factor models including PCA and factor analysis can be interpreted as learning a manifold (Hinton et al., 1997). We can view probabilistic PCA as defining a thin pancake-shaped region of high probability - a Gaussian distribution that is very narrow along some axes, just as a pancake is very flat along its vertical axis, but is elongated along other axes, just as a pancake is wide along its horizontal axes. This is illustrated in figure below. PCA can be interpreted as aligning this pancake with a linear manifold in a higher-dimensional space.
    Linear Factor Models_线性代数_03

  • This interpretation applies not just to traditional PCA but also to any linear autoencoder that learns matrices W \boldsymbol{W} W and V \boldsymbol{V} V with the goal of making the reconstruction of x \boldsymbol{x} x lie as close to x \boldsymbol{x} x as possible

  • Let the encoder be
    h = f ( x ) = W ⊤ ( x − μ ) \boldsymbol{h}=f(\boldsymbol{x})=\boldsymbol{W}^{\top}(\boldsymbol{x}-\boldsymbol{\mu}) h=f(x)=W(x−μ)

  • The encoder computes a low-dimensional representation of h h h. With the autoencoder view, we have a decoder computing the reconstruction
    x ^ = g ( h ) = b + V h \hat{\boldsymbol{x}}=g(\boldsymbol{h})=\boldsymbol{b}+\boldsymbol{V} \boldsymbol{h} x^=g(h)=b+Vh

  • The choices of linear encoder and decoder that minimize reconstruction error
    E [ ∥ x − x ^ ∥ 2 ] \mathbb{E}\left[\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|^{2}\right] E[∥x−x^2]
    correspond to V = W , μ = b = E [ x ] \boldsymbol{V}=\boldsymbol{W}, \boldsymbol{\mu}=\boldsymbol{b}=\mathbb{E}[\boldsymbol{x}] V=W,μ=b=E[x] and the columns of W \boldsymbol{W} W form an orthonormal
    basis which spans the same subspace as the principal eigenvectors of the covariance matrix C = E [ ( x − μ ) ( x μ ) ⊤ ] \boldsymbol{C}=\mathbb{E}\left[(\boldsymbol{x}-\boldsymbol{\mu})(\boldsymbol{x} \boldsymbol{\mu})^{\top}\right] C=E[(x−μ)(xμ)]

  • In the case of PCA, the columns of W \boldsymbol{W} W are these eigenvectors, ordered by the magnitude of the corresponding eigenvalues (which are all real and non-negative) One can also show that eigenvalue λ i \lambda_{i} λi​ of C \boldsymbol{C} C corresponds to the variance of x \boldsymbol{x} x in the direction of eigenvector v ( i ) \boldsymbol{v}^{(i)} v(i). If x ∈ R D \boldsymbol{x} \in \mathbb{R}^{D} x∈RD and h ∈ R d \boldsymbol{h} \in \mathbb{R}^{d} h∈Rd with d < D d<D d<D, then the optimal reconstruction error (choosing μ , b , V \boldsymbol{\mu}, \boldsymbol{b}, \boldsymbol{V} μ,b,V and W \boldsymbol{W} W as above) is
    min ⁡ E [ ∥ x − x ^ ∥ 2 ] = ∑ i = d + 1 D λ i \min \mathbb{E}\left[\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|^{2}\right]=\sum_{i=d+1}^{D} \lambda_{i} minE[∥x−x^2]=i=d+1D​λi

  • Hence, if the covariance has rank d d d, the eigenvalues λ d + 1 \lambda_{d+1} λd+1​ to λ D \lambda_{D} λD​ are 0 and reconstruction error is 0.

  • Furthermore, one can also show that the above solution can be obtained by maximizing the variances of the elements of h \boldsymbol{h} h, under orthogonal W \boldsymbol{W} W, instead of minimizing reconstruction error.

  • Linear factor models are some of the simplest generative models and some of the simplest models that learn a representation of data.

  • Much as linear classifiers and linear regression models may be extended to deep feedforward networks, these linear factor models may be extended to autoencoder networks and deep probabilistic models that perform the same tasks but with a much more powerful and flexible model family.