在之前介绍的梯度下降法的步骤中,在每次更新参数时是需要计算所有样本的,通过对整个数据集的所有样本的计算来求解梯度的方向。这种计算方法被称为:批量梯度下降法BGD(Batch Gradient Descent)。但是这种方法在数据量很大时需要计算很久。
针对该缺点,有一种更好的方法:随机梯度下降法SGD(stochastic gradient descent),随机梯度下降是每次迭代使用一个样本来对参数进行更新。虽然不是每次迭代得到的损失函数都向着全局最优方向,但是大的整体的方向是向全局最优解的,最终的结果往往是在全局最优解附近。但是相比于批量梯度,这样的方法更快,我们也是可以接受的。下面就来学学随机梯度下降法吧!
1.BGD&SGD效果比较
1.1 批量梯度下降法代码
import numpy as np
import matplotlib.pyplot as plt
m = 100000
x = np.random.normal(size=m)
X = x.reshape(-1,1)
y = 4. * x + 3. + np.random.normal(0, 3, size=m)
def J(theta, X_b, y):
try:
return np.sum((y-X_b.dot(theta)) ** 2) / len(y)
except:
return float('inf')
def dJ(theta, X_b, y):
return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)
def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-6):
theta = initial_theta
cur_iter = 0
while(cur_iter < n_iters):
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient
if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
break
cur_iter += 1
return theta
%time
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)
theta
Wall time: 0 ns
array([2.98334292, 3.97498524])
1.2 随机梯度下降
# 传递的不是整个矩阵X_b,而是其中一行X_b_i;传递y其中的一个数值y_i
def dJ_sgd(theta, X_b_i, y_i):
return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.
def sgd(X_b, y, initial_theta, n_iters):
t0 = 5
t1 = 50
#
def learning_rate(cur_iter):
return t0 / (cur_iter + t1)
theta = initial_theta
for cur_iter in range(n_iters):
# 随机找到一个样本(得到其索引)
rand_i = np.random.randint(len(X_b))
gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
theta = theta - learning_rate(cur_iter) * gradient
return theta
%time
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=len(X_b)//3)
print(theta)
Wall time: 0 ns
[2.97449903 3.99325008]
1.3 分析
通过简单的例子,我们可以看出,在批量梯度下降的过程中消耗时间1.73 s,而在随机梯度下降中,只使用300 ms。并且随机梯度下降中只考虑的三分一的样本量,且得到的结果一定达到了局部最小值的范围内。
2.代码实现
2.1 改进后的代码
def fit_sgd(self, X_train, y_train, n_iters=50, t0=5, t1=50):
"""根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert n_iters >= 1
def dJ_sgd(theta, X_b_i, y_i):
return X_b_i * (X_b_i.dot(theta) - y_i) * 2.
def sgd(X_b, y, initial_theta, n_iters=5, t0=5, t1=50):
def learning_rate(t):
return t0 / (t + t1)
theta = initial_theta
m = len(X_b)
for i_iter in range(n_iters):
# 将原本的数据随机打乱,然后再按顺序取值就相当于随机取值
indexes = np.random.permutation(m)
X_b_new = X_b[indexes,:]
y_new = y[indexes]
for i in range(m):
gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
theta = theta - learning_rate(i_iter * m + i) * gradient
return theta
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
initial_theta = np.random.randn(X_b.shape[1])
self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
2.2 使用自己的SGD
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
from myAlgorithm.LinearRegression import LinearRegression
from myAlgorithm.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_std = standardScaler.transform(X_train)
X_test_std = standardScaler.transform(X_test)
lin_reg1 = LinearRegression()
lin_reg1.fit_sgd(X_train, y_train, n_iters=2)
lin_reg1.score(X_test_std, y_test)
"""
输出:
0.78651716204682975
"""
通过增加的值,可以获得更好的结果:
lin_reg1.fit_sgd(X_train, y_train, n_iters=100)
lin_reg1.score(X_test_std, y_test)
"""
输出:
0.81294846132723497
"""
2.3 sklearn中的SGD
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor() # 默认n_iter=5
%time sgd_reg.fit(X_train_std, y_train)
sgd_reg.score(X_test_std, y_test)
速度非常快!增加迭代次数,可以提升效果
3.总结
批量梯度下降法BGD(Batch Gradient Descent)。
- 优点:全局最优解;易于并行实现;
- 缺点:当样本数据很多时,计算量开销大,计算速度慢。
针对于上述缺点,其实有一种更好的方法:随机梯度下降法SGD(stochastic gradient descent),随机梯度下降是每次迭代使用一个样本来对参数进行更新。 - 优点:计算速度快;
- 缺点:收敛性能不好