Different approaches for different problems.
e.g. dropout for good results on testing data.
Choosing proper loss
- Square Error
∑i=1n(yi−yiˆ)2
- Cross Entropy
−∑i=1nyiˆlnyi
Mini-batch
We do not really minimize total loss!
batch_size: 每次批处理训练样本个数;
nb_epoch: 整个训练数据重复处理次数。
总的训练样本数量不变。
Mini-batch is Faster. Not always true with parallel computing.
Mini-batch has better performance!
Shuffle the training examples for each epoch. This is the default of Keras.
New activation function
Q: Vanishing Gradient Problem
- Smaller gradients
- Learn very slow
- Almost random
- Larger gradients
- Learn very fast
- Already converge
2006 RBM –> 2015 ReLU
ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
A Thinner linear network. Do not have smaller gradients.
Adaptive Learning Rate
Set the learning rate η
- If learning rate is too large, total loss may not decrease after each update.
- If learning rate is too small, training would be too slow.
Solution:
- Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
- At the beginning, use larger learning rate
- After several epochs, reduce the learning rate. E.g. 1/t decay: ηt=η/t+1−−−−√
- Learning rate cannot be one-size-fits-all.
- Giving different parameters different learning rates
Adagrad: w=w−ηw∂L/∂w
ηw: Parameter dependent learning rate.
ηw=η∑ti=0(gi)2
η: constant
gi: is ∂L/∂w
Summation of the square of the previous derivatives.
Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.
- Adagrad[John Duchi, JMLR’11]
- RMSprop
https://www.youtube.com/watch?v=O3sxAc4hxZU
- Adadelta[Matthew D. Zeiler, arXiv’12]
- “No more pesky learning rates” [Tom Schaul, arXiv’12]
- AdaSecant[Caglar Gulcehre, arXiv’14]
- Adam[DiederikP. Kingma, ICLR’15]
- Nadam
http://cs229.stanford.edu/proj2015/054_report.pdf
Momentum
Overfitting
- Learning target is defined by thetraining data.
- Training data and testing data can be different.
- The parameters achieving the learning target do not necessary have good results on thetesting data.
- Panacea for Overfitting
- Have more training data
- Createmoretrainingdata
Early Stopping
Regularization
Weight decay is one kind of regularization.
Dropout
Training
- Each time before updating the parameters
- Each neuron has p% to dropout
The structure of the network is changed.
- Using the new network for training
For each mini-batch, we resample the dropout neurons.
Testing
**No dropout**
- If the dropout rate at training is p%, all the weights times (1-p)%
- Assume that the dropout rate is 50%.
If a weight w = 1 by training, set w = 0.5 for testing.
Dropout -Intuitive Reason
- When teams up, if everyone expect the partner will do the work, nothing will be done finally.
- However, if you know your partner will dropout, you will do better.
- When testing, no one dropout actually, so obtaining good results eventually.
Dropout is a kind of ensemble
Network Structure
CNN is a very good example!
参考