DL-1: Tips for Training Deep Neural Network

原创

Digital2Slave 2022-09-09 06:39:03 博主文章分类：Algorithm ©著作权

文章标签 deep-learning ide 数据批处理 文章分类 运维

©著作权归作者所有：来自51CTO博客作者Digital2Slave的原创作品，请联系作者获取转载授权，否则将追究法律责任

Different approaches for different problems.

e.g. dropout for good results on testing data.

Choosing proper loss

Square Error

∑i=1n(yi−yiˆ)2

Cross Entropy

−∑i=1nyiˆlnyi

Mini-batch

We do not really minimize total loss!

batch_size: 每次批处理训练样本个数；
nb_epoch: 整个训练数据重复处理次数。
总的训练样本数量不变。

Mini-batch is Faster. Not always true with parallel computing.

Mini-batch has better performance!

Shuffle the training examples for each epoch. This is the default of Keras.

New activation function

Q: Vanishing Gradient Problem

Smaller gradients
Learn very slow
Almost random
Larger gradients
Learn very fast
Already converge

2006 RBM –> 2015 ReLU

ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem

A Thinner linear network. Do not have smaller gradients.

DL-1: Tips for Training Deep Neural Network_ide

DL-1: Tips for Training Deep Neural Network_ide_02

DL-1: Tips for Training Deep Neural Network_ide_03

DL-1: Tips for Training Deep Neural Network_ide_04

Adaptive Learning Rate

Set the learning rate η

If learning rate is too large, total loss may not decrease after each update.
If learning rate is too small, training would be too slow.

Solution:

Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.

At the beginning, use larger learning rate
After several epochs, reduce the learning rate. E.g. 1/t decay: ηt=η/t+1−−−−√

Learning rate cannot be one-size-fits-all.

Giving different parameters different learning rates

Adagrad: w=w−ηw∂L/∂w
ηw: Parameter dependent learning rate.

ηw=η∑ti=0(gi)2

η: constant
gi: is ∂L/∂w

Summation of the square of the previous derivatives.

Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.

Adagrad[John Duchi, JMLR’11]
RMSprop
https://www.youtube.com/watch?v=O3sxAc4hxZU
Adadelta[Matthew D. Zeiler, arXiv’12]
“No more pesky learning rates” [Tom Schaul, arXiv’12]
AdaSecant[Caglar Gulcehre, arXiv’14]
Adam[DiederikP. Kingma, ICLR’15]
Nadam
http://cs229.stanford.edu/proj2015/054_report.pdf

Momentum

DL-1: Tips for Training Deep Neural Network_deep-learning_05

DL-1: Tips for Training Deep Neural Network_ide_06

Overfitting

Learning target is defined by thetraining data.
Training data and testing data can be different.
The parameters achieving the learning target do not necessary have good results on thetesting data.
Panacea for Overfitting

Have more training data
Createmoretrainingdata

Early Stopping

Keras-Early Stopping

Regularization

Weight decay is one kind of regularization.

Keras-regularizers

Dropout

Training

Each time before updating the parameters

Each neuron has p% to dropout

The structure of the network is changed.

Using the new network for training
For each mini-batch, we resample the dropout neurons.

Testing

**No dropout**

If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%.

If a weight w = 1 by training, set w = 0.5 for testing.

Dropout -Intuitive Reason

When teams up, if everyone expect the partner will do the work, nothing will be done finally.
However, if you know your partner will dropout, you will do better.
When testing, no one dropout actually, so obtaining good results eventually.

Dropout is a kind of ensemble

DL-1: Tips for Training Deep Neural Network_deep-learning_07

DL-1: Tips for Training Deep Neural Network_数据_08

DL-1: Tips for Training Deep Neural Network_批处理_09

DL-1: Tips for Training Deep Neural Network_ide_10

DL-1: Tips for Training Deep Neural Network_ide_11

Network Structure

CNN is a very good example!

参考

Deep Learning Tutorial-李宏毅(Hung-yi Lee)

上一篇：Mathematics

下一篇：自适应阈值Canny边缘检测

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯