在之前介绍的梯度下降法的步骤中,在每次更新参数时是需要计算所有样本的,通过对整个数据集的所有样本的计算来求解梯度的方向。这种计算方法被称为:批量梯度下降法BGD(Batch Gradient Descent)。但是这种方法在数据量很大时需要计算很久。

针对该缺点,有一种更好的方法:随机梯度下降法SGD(stochastic gradient descent),随机梯度下降是每次迭代使用一个样本来对参数进行更新。虽然不是每次迭代得到的损失函数都向着全局最优方向,但是大的整体的方向是向全局最优解的,最终的结果往往是在全局最优解附近。但是相比于批量梯度,这样的方法更快,我们也是可以接受的。下面就来学学随机梯度下降法吧!

1.BGD&SGD效果比较

1.1 批量梯度下降法代码

import numpy as np
import matplotlib.pyplot as plt

m = 100000
x = np.random.normal(size=m)
X = x.reshape(-1,1)
y = 4. * x + 3. + np.random.normal(0, 3, size=m)

def J(theta, X_b, y):
    try:
        return np.sum((y-X_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')

def dJ(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-6):
    theta = initial_theta
    cur_iter = 0
    while(cur_iter < n_iters):
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break
        cur_iter += 1
    return theta

%time
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)
theta
Wall time: 0 ns





array([2.98334292, 3.97498524])

1.2 随机梯度下降

# 传递的不是整个矩阵X_b,而是其中一行X_b_i;传递y其中的一个数值y_i
def dJ_sgd(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2.

def sgd(X_b, y, initial_theta, n_iters):
    t0 = 5
    t1 = 50
    # 
    def learning_rate(cur_iter):
        return t0 / (cur_iter + t1)
    theta = initial_theta

    for cur_iter in range(n_iters):
        # 随机找到一个样本(得到其索引)
        rand_i = np.random.randint(len(X_b))
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i])
        theta = theta - learning_rate(cur_iter) * gradient
    return theta

%time
X_b = np.hstack([np.ones((len(X),1)),X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=len(X_b)//3)
print(theta)
Wall time: 0 ns
[2.97449903 3.99325008]

1.3 分析

通过简单的例子,我们可以看出,在批量梯度下降的过程中消耗时间1.73 s,而在随机梯度下降中,只使用300 ms。并且随机梯度下降中只考虑的三分一的样本量,且得到的结果一定达到了局部最小值的范围内。

2.代码实现

2.1 改进后的代码

def fit_sgd(self, X_train, y_train, n_iters=50, t0=5, t1=50):
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert n_iters >= 1

        def dJ_sgd(theta, X_b_i, y_i):
            return X_b_i * (X_b_i.dot(theta) - y_i) * 2.

        def sgd(X_b, y, initial_theta, n_iters=5, t0=5, t1=50):

            def learning_rate(t):
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)
            for i_iter in range(n_iters):
                # 将原本的数据随机打乱,然后再按顺序取值就相当于随机取值
                indexes = np.random.permutation(m)
                X_b_new = X_b[indexes,:]
                y_new = y[indexes]
                for i in range(m):
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(i_iter * m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.random.randn(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)
        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]
        return self

2.2 使用自己的SGD

import numpy as np
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

from myAlgorithm.LinearRegression import LinearRegression
from myAlgorithm.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_std = standardScaler.transform(X_train)
X_test_std = standardScaler.transform(X_test)
lin_reg1 = LinearRegression()
lin_reg1.fit_sgd(X_train, y_train, n_iters=2)
lin_reg1.score(X_test_std, y_test)
"""
输出:
0.78651716204682975
"""

通过增加python随机梯度下降伪代码_迭代的值,可以获得更好的结果:

lin_reg1.fit_sgd(X_train, y_train, n_iters=100)
lin_reg1.score(X_test_std, y_test)
"""
输出:
0.81294846132723497
"""

2.3 sklearn中的SGD

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor()    # 默认n_iter=5
%time sgd_reg.fit(X_train_std, y_train)
sgd_reg.score(X_test_std, y_test)

速度非常快!增加迭代次数,可以提升效果

3.总结

批量梯度下降法BGD(Batch Gradient Descent)。

  • 优点:全局最优解;易于并行实现;
  • 缺点:当样本数据很多时,计算量开销大,计算速度慢。
    针对于上述缺点,其实有一种更好的方法:随机梯度下降法SGD(stochastic gradient descent),随机梯度下降是每次迭代使用一个样本来对参数进行更新。
  • 优点:计算速度快;
  • 缺点:收敛性能不好