Officially begin

Deep = Many hidden layers

Neurall Network

Find a function in function set.

Goodness of function

Pick the best function

Backpropagation - Backward Pass(反向传播)

反向的neural network

Regression

  • Stock Market Forecast
  • Self-driving Car
  • Recommendation

Step 1 : Model

A set of function

Step 2 : Goodness of Function

$$ \hat{y}^1代表x^1对应的确切值 $$

Loss function L: $$ L(f)=L(w,b) ~ Estimated ~ y ~ basedoninputfunction $$

$$ L(w,b)=\sum_{n=1}^{10}(\hat{y}^n-(b+w\cdot x_{cp}^n))^2 $$

Step 3 :Best Function

In linear regression, the loss function L is convex.

Overfitting

Regularization

$$ L(w,b)=\sum_{n=1}^{10}(\hat{y}^n-(b+w\cdot x_{cp}^n))^2+\lambda\cdot \sum(w_i)^2 $$

不需要考虑bias,调整平滑程度,smooth

  • Gradient descent
  • Overfitting and Regularization

Classification

independently and identically distributed(i.i.d) $$ L(h^{train},D_{all})-L(h^{all}, d_{all}) \leq \delta\ we\ need \ \forall h \in \H, |L(h,D_{train}) -L(h,D_{all}) | \leq \delta/2\ L(h^{train},D_{all})\leq L(h^{train},D_{all}) + \delta/2 $$ 重温数码宝贝:

模型出现bad的概率: $$ P(D_{train}\ is\ bad)\leq |H| \cdot 2exp(-2N\epsilon^2 ) \ N \ge \frac{log(2|H|/\delta)}{2\epsilon^2} $$ Tradeoff of Model Complexity

Training data for Classification

pair

Ideal Alternatives

Function(Model):

$$ f(x)\ x -> g(x)>0~Output=class1\ else\ Output=class2 $$

lossfunction:

The number of times of get incotrrect results on training data. $$ L(f) = \sum_{n}\delta(f(x^n)\neq\hat{y}^n) $$

Find the best function;
  • Example : Perceptron, SVM

Prior

$$ P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)} $$

Gaussian

Maximum Likelihood

2D array or 3D array mean the array with 2 or 3 axes respectively, but the n-dimensional vector mean the vector of length n.

Learn something that can really differ you from others.

Logistics Regression

Function Set

$$ f_{w,b}=\sigma(\sum_{i}w_ix_i)+b $$

Output : Between 0 and 1 $$ f_{w,b}(x)=P_{w,b}(C_1|x) $$

$$ w^,b^=arg\ \underset{w,b}{max}L(w,b)\ 等同于 w^,b^ = arg\ \underset{w,b}{min}-lnL(w,b) $$

Cross Entropy: $$ Distribution \ p: p(x=1)=\hat{y}^n\ p(x=0)=1-\hat{y}^n\ Distribution \ q: q(x=1)=f(x^n)\ q(x=0)=1-f(x^n)\ H(p,q)=-\sum_xp(x)ln(q(x)) $$

Loss Function

$$ L(f)=\sum_nC(f(x^n),\hat{y}^n)\ C(f(x^n),\hat{y}^n)=-[\hat{y}^nlnf(x^n)+(1-\hat{y}^n)ln(1-f(x^n))] $$

Update

logistic regression 和 linear regression 形式完全相同 $$ w_i\gets w_i-\eta \sum_{n}-(\hat{y}^n-f_{w,b}(x^n))x_i^n $$

Discriminative (logistic) & Generative (Gaussian描述)

Generative做了某些假设。

  • Benefit of generative model
    • With the assumption of probability distribution, less training data is needed
    • With the assumption of probability distribution, more robust to the noise
    • Priors and class-dependent probabilities can be estimated from different sources.

Multi-class Classification

SoftMax $$ Softmax(z_i)=\frac{e^{z_i}}{\sum_{c=1}^{C} e^{z_c}}\ 1 > y_i' > 0\ \sum_iy_i'=1 $$

Limitation of Logistic Regression

只能画一条直线

  • Feature Transformation
    • Cascading logistic regression models

Optimization Issue

层数较多表现的反而没有层数较少的好

Over fitting

  • 增加训练资料

  • Data augmentation

  • constrained model

    • Less parameters, sharing parameters
    • Less features
    • Early stopping

CNN->比较没有弹性的model

分Training Set

  • N-fold Cross Validation

Optimization Fail

H : Hessian

Tayler Series Approximation $$ L(\theta) \approx L(\theta^\prime)+\frac{1}{2}(\theta-\theta^\prime)^TH(\theta-\theta^\prime) $$

  • H is positive definte = All eigen values are positive -> local minima
  • H is negative definte = All eigen values are negative -> **local **
  • Some eigen values are positive , and some are negative -> Saddle point

在高维下local minima可能会变成saddle poing

Tips for training : Batch and Momentum

Batch

1 epoch = see all the batches once -> shuffle after each epoch

Momentum

Movement not just based on gradient, but previous movement.

Different parameters needs different learning rate

$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta}{\sigma_i^t}g_i^t\ \sigma_i^t=\sqrt{\frac{1}{t+1}\sum_{i=0}^t(g_i^t)^2} $$

Adagred

RMSProp

$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta}{\sigma_i^t}g_i^t\ \sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2+(1-\alpha)(g_i^t)^2} $$

Adam : RMSProp + Momentum

Learning Rate Sceduling

$$ \theta_i^{t+1} \gets \theta_i^t-\frac{\eta^t}{\sigma_i^t}g_i^t\ $$