文章目录



Mind Map

CODE WORKS

​Work Here!​

CONTENTS

Learning Algorithms

  • A machine learning algorithm is an algorithm that is able to learn from data. But what do we mean by learning? Mitchell (1997) provides the definition "A computer program is said to learn from experience E E E with respect to some class of tasks T T T and performance measure P P P, if its performance at tasks in T T T, as measured by P P P, improves with experience E . " E . " E."

The Task, T T T

  • In this relatively formal definition of the word “task,” the process of learning itself is not the task.
    Machine learning tasks are usually described in terms of how the machine learning system should process an example.
    An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process. We typically represent an example as a vector x ∈ R n x \in \mathbb{R}^{n} x∈Rn where each entry x i x_{i} xi​ of the vector is another feature. For example, the features of an image are usually the values of the pixels in the image.
    Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:

  • Classification: In this type of task, the computer program is asked to specify which of k k k categories some input belongs to. To solve this task, the learning algorithm is usually asked to produce a function f : R n → { 1 , … , k } f: \mathbb{R}^{n} \rightarrow\{1, \ldots, k\} f:Rn→{1,…,k}. When y = f ( x ) y=f(\boldsymbol{x}) y=f(x), the model assigns an input described by vector x \boldsymbol{x} x to a category identified by numeric code y y y. There are other variants of the classification task, for example, where f f f outputs a probability distribution over classes.
  • Classification with missing inputs: Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. In order to solve the classification task, the learning algorithm only has to define a single function mapping from a vector input to a categorical output. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. Each function corresponds to classifying x x x with a different subset of its inputs missing. This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. One way to efficiently define such a large set of functions is to learn a probability distribution over all of the relevant variables, then solve the classification task by marginalizing out the missing variables.
  • Regression: In this type of task, the computer program is asked to predict a numerical value given some input. To solve this task, the learning algorithm is asked to output a function f : R n → R f: \mathbb{R}^{n} \rightarrow \mathbb{R} f:Rn→R. This type of task is similar to classification, except that the format of output is different.
  • Transcription(转录): In this type of task, the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form. For example, in optical character recognition, the computer program is shown a photograph containing an image of text and is asked to return this text in the form of a sequence of characters (e.g., in ASCII or Unicode format).
  • Machine translation: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.
  • Structured output: Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the different elements. This is a broad category, and subsumes the transcription and translation tasks described above, but also many other tasks. One example is parsing- mapping a natural language sentence into a tree that describes its grammatical structure and tagging nodes of the trees as being verbs, nouns, or adverbs, and so on.
  • Anomaly detection: In this type of task, the computer program sifts through a set of events or objects, and flags some of them as being unusual or atypical. An example of an anomaly detection task is credit card fraud detection.
  • Synthesis and sampling: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data.
  • Denoising: In this type of task, the machine learning algorithm is given in input a corrupted example x ~ ∈ R n \tilde{\boldsymbol{x}} \in \mathbb{R}^{n} x~∈Rn obtained by an unknown corruption process from a clean example x ∈ R n \boldsymbol{x} \in \mathbb{R}^{n} x∈Rn. The learner must predict the clean example x x x from its corrupted version x ~ \tilde{x} x~, or more generally predict the conditional probability distribution p ( x ∣ x ~ ) p(\boldsymbol{x} \mid \tilde{\boldsymbol{x}}) p(x∣x~).
  • Density estimation or probability mass function estimation: In the density estimation problem, the machine learning algorithm is asked to learn a function p model  : R n → R p_{\text {model }}: \mathbb{R}^{n} \rightarrow \mathbb{R} pmodel :Rn→R, where p model  ( x ) p_{\text {model }}(\boldsymbol{x}) pmodel (x) can be interpreted as a probability density function (if x \mathrm{x} x is continuous) or a probability mass function (if x \mathrm{x} x is discrete) on the space that the examples were drawn from. To do such a task well (we will specify exactly what that means when we discuss performance measures P P P ), the algorithm needs to learn the structure of the data it has seen. It must know where examples cluster tightly and where they are unlikely to occur. Most of the tasks described above require the learning algorithm to at least implicitly capture the structure of the probability distribution. Density estimation allows us to explicitly capture that distribution. In principle, we can then perform computations on that distribution in order to solve the other tasks as well.

The Performance Measure, P P P


  • In order to evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure P P P is specific to the task T T T being carried out by the system.
  • The choice of performance measure may seem straightforward and objective, but it is often difficult to choose a performance measure that corresponds well to the desired behavior of the system.

The Experience, E E E



Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process.



Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. In the context of deep learning, we usually want to learn the entire probability distribution that generated a dataset, whether explicitly as in density estimation or implicitly for tasks like synthesis or denoising. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the dataset into clusters of similar examples.



Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.



Roughly speaking, unsupervised learning involves observing several examples of a random vector x \mathrm{x} x, and attempting to implicitly or explicitly learn the probability distribution p ( x ) p(\mathbf{x}) p(x), or some interesting properties of that distribution, while supervised learning involves observing several examples of a random vector x \mathbf{x} x and an associated value or vector y \mathbf{y} y, and learning to predict y \mathbf{y} y from x \mathbf{x} x, usually by estimating p ( y ∣ x ) p(\mathbf{y} \mid \mathbf{x}) p(y∣x).



Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred. Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that for a vector x ∈ R n \mathbf{x} \in \mathbb{R}^{n} x∈Rn, the joint distribution can be decomposed as
p ( x ) = ∏ i = 1 n p ( x i ∣ x 1 , … , x i − 1 ) p(\mathbf{x})=\prod_{i=1}^{n} p\left(\mathrm{x}_{i} \mid \mathrm{x}_{1}, \ldots, \mathrm{x}_{i-1}\right) p(x)=i=1∏n​p(xi​∣x1​,…,xi−1​)
This decomposition means that we can solve the ostensibly(表面上的) unsupervised problem of modeling p ( x ) p(\mathbf{x}) p(x) by splitting it into n n n supervised learning problems. Alternatively, we can solve the supervised learning problem of learning p ( y ∣ x ) p(y \mid \mathbf{x}) p(y∣x) by using traditional unsupervised learning technologies to learn the joint distribution p ( x , y ) p(\mathbf{x}, y) p(x,y) and inferring
p ( y ∣ x ) = p ( x , y ) ∑ y ′ p ( x , y ′ ) p(y \mid \mathbf{x})=\frac{p(\mathbf{x}, y)}{\sum_{y^{\prime}} p\left(\mathbf{x}, y^{\prime}\right)} p(y∣x)=∑y′​p(x,y′)p(x,y)​



Traditionally, people refer to regression, classification and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered unsupervised learning.



Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences.



One common way of describing a dataset is with a design matrix. A design matrix is a matrix containing a different example in each row. Each column of the matrix corresponds to a different feature. For instance, the Iris dataset contains 150 examples with four features for each example. This means we can represent the dataset with a design matrix X ∈ R 150 × 4 \boldsymbol{X} \in \mathbb{R}^{150 \times 4} X∈R150×4, where X i , 1 X_{i, 1} Xi,1​ is the sepal length of plant i , X i , 2 i, X_{i, 2} i,Xi,2​ is the sepal width of plant i i i, etc. We will describe most of the learning algorithms in this book in terms of how they operate on design matrix datasets. Of course, to describe a dataset as a design matrix, it must be possible to describe each example as a vector, and each of these vectors must be the same size. This is not always possible. For example, if you have a collection of photographs with different widths and heights, then different photographs will contain different numbers of pixels, so not all of the photographs may be described with the same length of vector.



In the case of supervised learning, the example contains a label or target as well as a collection of features. For example, if we want to use a learning algorithm to perform object recognition from photographs, we need to specify which object appears in each of the photos. We might do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working with a dataset containing a design matrix of feature observations X \boldsymbol{X} X, we also provide a vector of labels y \boldsymbol{y} y, with y i y_{i} yi​ providing the label for example i i i.

Of course, sometimes the label may be more than just a single number. For example, if we want to train a speech recognition system to transcribe entire sentences, then the label for each example sentence is a sequence of words.



Just as there is no formal definition of supervised and unsupervised learning, there is no rigid taxonomy(分类) of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.



Example: Linear Regression



As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector x ∈ R n \boldsymbol{x} \in \mathbb{R}^{n} x∈Rn as input and predict the value of a scalar y ∈ R y \in \mathbb{R} y∈R as its output. In the case of linear regression, the output is a linear function of the input. Let y ^ \hat{y} y^​ be the value that our model predicts y y y should take on. We define the output to be
y ^ = w ⊤ x \hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x} y^​=w⊤x
where w ∈ R n \boldsymbol{w} \in \mathbb{R}^{n} w∈Rn is a vector of parameters.



Parameters are values that control the behavior of the system. In this case, w i w_{i} wi​ is the coefficient that we multiply by feature x i x_{i} xi​ before summing up the contributions from all the features. We can think of w \boldsymbol{w} w as a set of weights that determine how each feature affects the prediction. If a feature x i x_{i} xi​ receives a positive weight w i w_{i} wi​, then increasing the value of that feature increases the value of our prediction y ^ \hat{y} y^​. If a feature receives a negative weight, then increasing the value of that feature decreases the value of our prediction. If a feature’s weight is large in magnitude, then it has a large effect on the prediction. If a feature’s weight is zero, it has no effect on the prediction.



We thus have a definition of our task T : T: T: to predict y y y from x x x by outputting y ^ = w ⊤ x \hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x} y^​=w⊤x. Next we need a definition of our performance measure, P P P.



Suppose that we have a design matrix of m m m example inputs that we will not use for training, only for evaluating how well the model performs. We also have a vector of regression targets providing the correct value of y y y for each of these examples. Because this dataset will only be used for evaluation, we call it the test set. We refer to the design matrix of inputs as X ( test  ) \boldsymbol{X}^{(\text {test })} X(test ) and the vector of regression targets as y (test)  \boldsymbol{y}^{\text {(test) }} y(test) .



One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If y ^ (test)  \hat{y}^{\text {(test) }} y^​(test)  gives the predictions of the model on the test set, then the mean squared error is given by
M S E test  = 1 m ∑ i ( y ^ ( test  ) − y ( test  ) ) i 2 \mathrm{MSE}_{\text {test }}=\frac{1}{m} \sum_{i}\left(\hat{\boldsymbol{y}}^{(\text {test })}-\boldsymbol{y}^{(\text {test })}\right)_{i}^{2} MSEtest ​=m1​i∑​(y^​(test )−y(test ))i2​
Intuitively, one can see that this error measure decreases to 0 when y ^ ( test  ) = y ( test  ) \hat{\boldsymbol{y}}^{(\text {test })}=\boldsymbol{y}^{(\text {test })} y^​(test )=y(test ). We can also see that
M S E test  = 1 m ∥ y ^ ( test  ) − y ( test  ) ∥ 2 2 \mathrm{MSE}_{\text {test }}=\frac{1}{m}\left\|\hat{\boldsymbol{y}}^{(\text {test })}-\boldsymbol{y}^{(\text {test })}\right\|_{2}^{2} MSEtest ​=m1​∥∥∥​y^​(test )−y(test )∥∥∥​22​
so the error increases whenever the Euclidean distance between the predictions and the targets increases.



To make a machine learning algorithm, we need to design an algorithm that will improve the weights w \boldsymbol{w} w in a way that reduces M S E test  \mathrm{MSE}_{\text {test }} MSEtest ​ when the algorithm is allowed to gain experience by observing a training set ( X ( train ⁡ ) , y ( train  ) ) . \left(\boldsymbol{X}^{(\operatorname{train})}, \boldsymbol{y}^{(\text {train })}\right) . (X(train),y(train )). One intuitive way of doing this is just to minimize the mean squared error on the training set, MSE train  _{\text {train }} train ​. To minimize MSE train, we can simply solve for where its gradient is 0 \mathbf{0} 0 :
∇ w M S E train  = 0 ⇒ ∇ w 1 m ∥ y ^ ( train  ) − y ( train  ) ∥ 2 2 = 0 ⇒ 1 m ∇ w ∥ X ( train  ) w − y ( train  ) ∥ 2 2 = 0 \begin{gathered} \nabla_{\boldsymbol{w}} \mathrm{MSE}_{\text {train }}=0 \\ \Rightarrow \nabla_{\boldsymbol{w}} \frac{1}{m}\left\|\hat{\boldsymbol{y}}^{(\text {train })}-\boldsymbol{y}^{(\text {train })}\right\|_{2}^{2}=0 \\ \Rightarrow \frac{1}{m} \nabla_{\boldsymbol{w}}\left\|\boldsymbol{X}^{(\text {train })} \boldsymbol{w}-\boldsymbol{y}^{(\text {train })}\right\|_{2}^{2}=0 \end{gathered} ∇w​MSEtrain ​=0⇒∇w​m1​∥∥∥​y^​(train )−y(train )∥∥∥​22​=0⇒m1​∇w​∥∥∥​X(train )w−y(train )∥∥∥​22​=0​
⇒ ∇ w ( X ( train ⁡ ) w − y ( train  ) ) ⊤ ( X ( train ⁡ ) w − y ( train ⁡ ) ) = 0 ⇒ ∇ w ( w ⊤ X ( train ⁡ ) ⊤ X ( train ⁡ ) w − 2 w ⊤ X ( train ⁡ ) ⊤ y ( train  ) + y ( train ⁡ ) ⊤ y ( train ⁡ ) ) = 0 ⇒ 2 X ( train ⁡ ) ⊤ X ( train ⁡ ) w − 2 X ( train ⁡ ) ⊤ y ( train ⁡ ) = 0 ⇒ w = ( X ( train ⁡ ) ⊤ X ( train ⁡ ) ) − 1 X ( train ⁡ ) ⊤ y ( train ⁡ ) \begin{array}{r} \Rightarrow \nabla_{\boldsymbol{w}}\left(\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-\boldsymbol{y}^{(\text {train })}\right)^{\top}\left(\boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-\boldsymbol{y}^{(\operatorname{train})}\right)=0 \\ \Rightarrow \nabla_{\boldsymbol{w}}\left(\boldsymbol{w}^{\top} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-2 \boldsymbol{w}^{\top} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\text {train })}+\boldsymbol{y}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})}\right)=0 \\ \Rightarrow 2 \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})} \boldsymbol{w}-2 \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})}=0 \\ \Rightarrow \boldsymbol{w}=\left(\boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{X}^{(\operatorname{train})}\right)^{-1} \boldsymbol{X}^{(\operatorname{train}) \top} \boldsymbol{y}^{(\operatorname{train})} \end{array} ⇒∇w​(X(train)w−y(train ))⊤(X(train)w−y(train))=0⇒∇w​(w⊤X(train)⊤X(train)w−2w⊤X(train)⊤y(train )+y(train)⊤y(train))=0⇒2X(train)⊤X(train)w−2X(train)⊤y(train)=0⇒w=(X(train)⊤X(train))−1X(train)⊤y(train)​(the last equation is the so called normal equations
It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter- an intercept term b b b. In this model
y ^ = w ⊤ x + b \hat{y}=\boldsymbol{w}^{\top} \boldsymbol{x}+b y^​=w⊤x+b
so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to affine functions means that the plot of the model’s predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter b b b, one can continue to use the model with only weights but augment x x x with an extra entry that is always set to 1. 1 . 1. The weight corresponding to the extra 1 entry plays the role of the bias parameter. We will frequently use the term “linear” when referring to affine functions throughout this book.

The intercept term b b b is often called the bias parameter of the affine transformation. This terminology derives from the point of view that the output of the transformation is biased toward being b b b in the absence of any input. This term is different from the idea of a statistical bias, in which a statistical estimation algorithm’s expected estimate of a quantity is not equal to the true quantity.