Different approaches for different problems.

e.g. dropout for good results on testing data.

Choosing proper loss

  • Square Error


∑i=1n(yi−yiˆ)2

  • Cross Entropy


−∑i=1nyiˆlnyi

Mini-batch

We do not really minimize total loss!

batch_size: 每次批处理训练样本个数;
nb_epoch: 整个训练数据重复处理次数。
总的训练样本数量不变。

Mini-batch is Faster. Not always true with parallel computing.

Mini-batch has better performance!

Shuffle the training examples for each epoch. This is the default of Keras.

New activation function

Q: Vanishing Gradient Problem

  • Smaller gradients
  • Learn very slow
  • Almost random
  • Larger gradients
  • Learn very fast
  • Already converge

2006 RBM –> 2015 ReLU

ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem

A Thinner linear network. Do not have smaller gradients.

DL-1: Tips for Training Deep Neural Network_ide

DL-1: Tips for Training Deep Neural Network_ide_02

DL-1: Tips for Training Deep Neural Network_ide_03

DL-1: Tips for Training Deep Neural Network_ide_04

Adaptive Learning Rate

Set the learning rate η

  • If learning rate is too large, total loss may not decrease after each update.
  • If learning rate is too small, training would be too slow.

Solution:

  • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
  • At the beginning, use larger learning rate
  • After several epochs, reduce the learning rate. E.g. 1/t decay: ηt=η/t+1−−−−√
  • Learning rate cannot be one-size-fits-all.
  • Giving different parameters different learning rates

Adagrad: w=w−ηw∂L/∂w
ηw: Parameter dependent learning rate.


ηw=η∑ti=0(gi)2

η: constant
gi: is ∂L/∂w

Summation of the square of the previous derivatives.

Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.

  • Adagrad[John Duchi, JMLR’11]
  • RMSprop
    ​​​https://www.youtube.com/watch?v=O3sxAc4hxZU​
  • Adadelta[Matthew D. Zeiler, arXiv’12]
  • No more pesky learning rates” [Tom Schaul, arXiv’12]
  • AdaSecant[Caglar Gulcehre, arXiv’14]
  • Adam[DiederikP. Kingma, ICLR’15]
  • Nadam
    ​​​http://cs229.stanford.edu/proj2015/054_report.pdf​

Momentum

DL-1: Tips for Training Deep Neural Network_deep-learning_05

DL-1: Tips for Training Deep Neural Network_ide_06

Overfitting

  • Learning target is defined by thetraining data.
  • Training data and testing data can be different.
  • The parameters achieving the learning target do not necessary have good results on thetesting data.
  • Panacea for Overfitting
  • Have more training data
  • Createmoretrainingdata

Early Stopping

​Keras-Early Stopping​

Regularization

Weight decay is one kind of regularization.

​Keras-regularizers​

Dropout

Training

  • Each time before updating the parameters
  1. Each neuron has p% to dropout

The structure of the network is changed.

  1. Using the new network for training
    For each mini-batch, we resample the dropout neurons.

Testing

**No dropout**

  • If the dropout rate at training is p%, all the weights times (1-p)%
  • Assume that the dropout rate is 50%.

If a weight w = 1 by training, set w = 0.5 for testing.

Dropout -Intuitive Reason

  • When teams up, if everyone expect the partner will do the work, nothing will be done finally.
  • However, if you know your partner will dropout, you will do better.
  • When testing, no one dropout actually, so obtaining good results eventually.

Dropout is a kind of ensemble

DL-1: Tips for Training Deep Neural Network_deep-learning_07

DL-1: Tips for Training Deep Neural Network_数据_08

DL-1: Tips for Training Deep Neural Network_批处理_09

DL-1: Tips for Training Deep Neural Network_ide_10

DL-1: Tips for Training Deep Neural Network_ide_11

Network Structure

CNN is a very good example!

参考