一、什么是线性回归?
- 定义:是利用回归方程(函数)对一个或多个自变量(特征值)与因变量(目标值)之间关系进行建模的一种分析方式。
二、线性回归的损失函数和优化原理
- 损失函数——最小二乘法
- 优化算法
- 正规方程
- 梯度下降(Gradient Descent)
- 梯度下降与正规方程的对比
梯度下降 | 正规方程 |
需要选择学习率 | 不需要 |
需要迭代求解 | 一次运算得出 |
特征数量较大可以使用 | 需要计算方程,时间复杂度高O(n3) |
- 选择
- 小规模数据:LinearRegression(不能解决拟合问题) 或 岭回归
- 大规模数据:SGDRegressor
三、线性回归API
- sklearn.linear_model.LinearRegression(fit_intercept = True)
- 通过正规方程优化
- fit_intercept:是否计算偏置
- LinearRegression.coef_:回归系数
- LinearRegression.intercept_:偏置
- sklearn.linear_model.SGDRegressor(loss=‘squared_loss’,fit_intercept=True,learning_rate=‘invscaling’,eta0=0.01)
- SGDRegressor类实现了随机梯度下降学习,它支持不同的loss函数和正则化惩罚项来拟合线性回归模型
- loss:损失类型
- loss=‘squared_loss’:普通最小二乘法
- fit_intercept :是否计算偏置
- learning_rate:string,optional
- 学习率填充,默认eta0=0.01
- ‘constant’:eta = eta0
- ‘optional’:eta = 1.0/(alpha * (t + t0))[default]
- ‘invscaling’:eta = eta0/pow(t,power_t)
- 默认power_t = 0.25:存在父类当中
- 对于一个常数值的学习率来说,可以使用learning_rate = ‘constant’,并使用eta0来指定学习效率
- SGDRegression.coef_:回归系数
- SGDRegression.intercept_:偏置
四、回归性能评估
- 均方误差(MSE)评价机制:
- sklearn.metrics.mean_squared_error(y_true,y_pred)
- 均方误差回归损失
- y_true:真实值
- y_pred:预测值
- return:浮点数结果
五、案例——波士顿房价预测
- 获取数据集
- 划分数据集
- 特征工程:无量纲化-标准化
- 预估器流程
- 模型评估
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import mean_squared_error
def linear1():
'''
正规方程的优化方法对波士顿房价进行预测
:return:
'''
# 1、获取数据
boston = load_boston()
# 2、划分数据集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 训练集的特征值x_train 测试集的特征值 x_test 训练集的目标值y_train 测试集的目标值y_test
# 3、标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4、预估器
estimator = LinearRegression()
estimator.fit(x_train,y_train)
# 5、得出模型
print('正规方程权重系数为:\n',estimator.coef_)
print('正规方程偏置为:\n',estimator.intercept_)
# 6、模型评估
y_predict = estimator.predict(x_test)
print('预测房价:\n',y_predict)
error = mean_squared_error(y_test,y_predict)
print('正规方程均方误差为:\n',error)
return None
def linear2():
'''
梯度下降的优化方法对波士顿房价进行预测
:return:
'''
# 1、获取数据
boston = load_boston()
print('特征数量:\n',boston.data.shape)
# 2、划分数据集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)
# 训练集的特征值x_train 测试集的特征值 x_test 训练集的目标值y_train 测试集的目标值y_test
# 3、标准化
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 4、预估器
estimator = SGDRegressor()
#可设置参数......
#estimator = SGDRegressor(learning_rate = 'constant',eta0=0.01,max_iter=10000,penalty='l1')
#默认penalty='l2'即为岭回归,但Ridge岭回归实现了SAG
estimator.fit(x_train, y_train)
# 5、得出模型
print('梯度下降权重系数为:\n', estimator.coef_)
print('梯度下降偏置为:\n', estimator.intercept_)
# 6、模型评估
y_predict = estimator.predict(x_test)
print('预测房价:\n',y_predict)
error = mean_squared_error(y_test,y_predict)
print('梯度下降均方误差为:\n',error)
return None
if __name__ == '__main__':
linear1()
linear2()
运行结果:
正规方程权重系数为:
[-0.64817766 1.14673408 -0.05949444 0.74216553 -1.95515269 2.70902585
-0.07737374 -3.29889391 2.50267196 -1.85679269 -1.75044624 0.87341624
-3.91336869]
正规方程偏置为:
22.62137203166228
预测房价:
[28.22944896 31.5122308 21.11612841 32.6663189 20.0023467 19.07315705
21.09772798 19.61400153 19.61907059 32.87611987 20.97911561 27.52898011
15.54701758 19.78630176 36.88641203 18.81202132 9.35912225 18.49452615
30.66499315 24.30184448 19.08220837 34.11391208 29.81386585 17.51775647
34.91026707 26.54967053 34.71035391 27.4268996 19.09095832 14.92742976
30.86877936 15.88271775 37.17548808 7.72101675 16.24074861 17.19211608
7.42140081 20.0098852 40.58481466 28.93190595 25.25404307 17.74970308
38.76446932 6.87996052 21.80450956 25.29110265 20.427491 20.4698034
17.25330064 26.12442519 8.48268143 27.50871869 30.58284841 16.56039764
9.38919181 35.54434377 32.29801978 21.81298945 17.60263689 22.0804256
23.49262401 24.10617033 20.1346492 38.5268066 24.58319594 19.78072415
13.93429891 6.75507808 42.03759064 21.9215625 16.91352899 22.58327744
40.76440704 21.3998946 36.89912238 27.19273661 20.97945544 20.37925063
25.3536439 22.18729123 31.13342301 20.39451125 23.99224334 31.54729547
26.74581308 20.90199941 29.08225233 21.98331503 26.29101202 20.17329401
25.49225305 24.09171045 19.90739221 16.35154974 15.25184758 18.40766132
24.83797801 16.61703662 20.89470344 26.70854061 20.7591883 17.88403312
24.28656105 23.37651493 21.64202047 36.81476219 15.86570054 21.42338732
32.81366203 33.74086414 20.61688336 26.88191023 22.65739323 17.35731771
21.67699248 21.65034728 27.66728556 25.04691687 23.73976625 14.6649641
15.17700342 3.81620663 29.18194848 20.68544417 22.32934783 28.01568563
28.58237108]
正规方程均方误差为:
20.627513763095397
特征数量:
(506, 13)
梯度下降权重系数为:
[-0.5387057 0.98812962 -0.3836621 0.77712463 -1.79806072 2.76223373
-0.14034882 -3.2027763 1.79438685 -1.06765596 -1.72835465 0.85346013
-3.88627076]
梯度下降偏置为:
[22.62428314]
预测房价:
[28.22463074 31.60053474 21.43190434 32.64228786 20.14177224 19.02837864
21.36401176 19.44415956 19.65878271 32.76309243 21.35890592 27.30154068
15.58835081 19.87398852 36.86567274 18.79653886 9.5647461 18.58924004
30.69291691 24.30975923 19.00772442 34.04867407 29.49856565 17.47627103
34.77817751 26.57538259 34.45294138 27.39673323 19.10235502 15.58849417
30.81884325 14.70932016 37.47925195 8.67089057 16.36541171 16.94229467
7.75518269 19.8271725 40.44600662 29.07409561 25.23991422 17.77295095
38.97079067 6.82180106 21.62257636 25.11239397 20.71025984 20.56078841
17.10131003 26.21643954 9.51261221 27.25480468 30.5788768 16.6885077
9.56993647 35.45509285 31.73501304 22.77276156 17.59978656 21.8706463
23.64477447 24.00814091 20.29560686 38.19606795 25.55395526 19.66517584
14.07775506 6.814339 42.1973785 21.89096222 16.8690623 22.51385589
40.77314925 21.7175092 36.84437262 27.17328693 21.66541892 20.74223503
25.32633238 23.46535274 31.41781941 20.28001469 23.96949932 31.54091728
27.14559167 20.8536532 29.11075856 21.91638293 26.63796157 18.99296673
25.39440915 24.01178618 19.85863199 17.51059371 15.42774908 18.31334289
24.66246227 16.71024419 20.69586315 26.70813475 20.6909886 17.90721535
24.16265733 23.28769447 20.49811805 36.60909837 15.98096152 22.34796787
32.70744198 33.64115246 20.58050466 26.16926031 23.03338967 17.71705189
21.50696067 21.8224659 27.5527674 25.25148377 23.684361 14.51800848
15.62355351 3.83890173 29.24555856 20.60086296 22.35743187 28.04365028
28.40251581]
梯度下降均方误差为:
21.054676473568634
Process finished with exit code 0