目录
一。线性回归的定义
二。损失函数(误差大小)
1.正规方程
2.梯度下降
3.梯度下降和正规方程对比
三。线性回归-波士顿房价数据案例
四。过拟合与欠拟合
1.欠拟合原因以及解决办法
2.过拟合原因以及解决办法
五。带有正则化的线性回归-岭回归
一。线性回归的定义
线性回归通过一个或者多个自变量与因变量之间之间进行建模的回归分析。其中特点为一个或多个称为回归系数的模型参数的线性组合
线性回归器是最为简单、易用的回归模型。从某种程度上限制了使用,尽管如此,在不知道特征之间关系的前提下,我们仍然使用线性回归器作为大多数系统的首要选择。
二。损失函数(误差大小)
尽量去减少这个损失
1.正规方程
2.梯度下降
w为回归系数
3.梯度下降和正规方程对比
小规模数据:LinearRegression(不能解决拟合问题)以及其它
大规模数据:SGDRegressor
三。线性回归-波士顿房价数据案例
•sklearn.linear_model.LinearRegression
•正规方程
•sklearn.linear_model.SGDRegressor
•梯度下降
•sklearn.metrics.mean_squared_error
•mean_squared_error(y_true, y_pred)注:真实值,预测值为标准化之前的值
•均方误差回归损失
•y_true:真实值
•y_pred:预测值
•return:浮点数结果
from sklearn.datasets import load_boston#导入数据
from sklearn.linear_model import LinearRegression,SGDRegressor,Ridge#正规方程,梯度下降,岭回归
from sklearn.model_selection import train_test_split#数据分隔
from sklearn.preprocessing import StandardScaler#对数据标准化
from sklearn.metrics import mean_squared_error#均方误差
def mylinear():
lb=load_boston()
x_train,x_test,y_train,y_test=train_test_split(lb.data,lb.target,test_size=0.25)
print(y_train,y_test)
#实例化两个标准化api
std_x=StandardScaler()
x_train=std_x.fit_transform(x_train)
x_test=std_x.transform(x_test)
std_y=StandardScaler()
y_train=std_y.fit_transform(y_train.reshape(-1, 1))
y_test=std_y.transform(y_test.reshape(-1, 1))
lr=LinearRegression()#正规方程求解方式预测结果(容易出现过拟合,为了把训练集数据表现更好,可以通过正则化解决-岭回归)
lr.fit(x_train,y_train)
print(lr.coef_)#coef_为回归系数
y_lr_predict=std_y.inverse_transform(lr.predict(x_test))#inverse_transform将标准化后的数据转化为标准化前的数据
print("正规方程测试集里面每个房子的预测价格:",y_lr_predict)
print("正规方程的均方误差:",mean_squared_error(std_y.inverse_transform(y_test),y_lr_predict))#inverse_transform将标准化后的数据转化为标准化前的数据
sgd=SGDRegressor()#梯度下降去进行房价预测
sgd.fit(x_train,y_train)
print(sgd.coef_)
y_sgd_predict=std_y.inverse_transform(sgd.predict(x_test))
print("梯度下降测试集里面每个房子的预测价格:",y_sgd_predict)
print("梯度下降的均方误差:",mean_squared_error(std_y.inverse_transform(y_test),y_sgd_predict))#第一个参数为真实值,第二个为预测值
rd=Ridge(alpha=1)#岭回归去进行房价预测
rd.fit(x_train,y_train)
print(rd.coef_)
y_rd_predict=std_y.inverse_transform(rd.predict(x_test))#inverse_transform将标准化后的数据转化为标准化前的数据
print("岭回归测试集里面每个房子的预测价格:",y_rd_predict)
print("岭回归的均方误差:",mean_squared_error(std_y.inverse_transform(y_test),y_rd_predict))
return None
if __name__=="__main__":
mylinear()
/Users/lichengxiang/opt/anaconda3/bin/python /Users/lichengxiang/Desktop/python/机器学习/线性回归-波士顿房价数据案例.py
[16.7 22. 21.2 24. 20. 18.4 36.5 26.2 30.5 48.8 13. 21.7 22.1 50.
24.5 23.8 33.8 29.8 17.2 36. 13.8 16.5 22.6 19.2 23.1 27.9 13.3 26.6
14.5 29. 19.8 22.4 18.5 22.3 5. 20.3 15.4 16.2 18.8 31.6 34.9 23.8
25. 27.1 19.9 13.1 25. 22.7 17.1 15.6 25. 13.1 11.7 20.7 13.4 22.5
35.4 25.1 20.1 18.3 50. 23.1 21.6 22.9 7.2 26.4 21.4 19.3 13.8 33.2
14.8 24.1 50. 30.3 24. 22.6 19.6 23.6 22.3 19.3 24.1 21.4 18.4 21.8
48.3 13.6 19.6 50. 20.1 20. 50. 30.8 24.8 31.2 13.4 23.9 20.5 23.1
25.3 16.4 12.7 17.4 50. 41.3 22. 10.5 16.5 26.5 12.1 27.9 24.4 27.5
19.9 22.9 20.6 32.5 22.2 22. 13.8 50. 31.7 22. 26.6 20.6 22.2 22.9
19. 20.1 21.2 21.2 20. 32.7 20.4 8.5 13.5 19.9 19.2 23. 17.1 50.
19.5 23.7 19.1 15.4 15.2 23. 31.5 33.4 16.1 13.4 5.6 30.1 45.4 16.1
18. 16.2 14.5 20.7 10.4 13.9 12.7 29.6 20.3 30.7 23.4 21.9 11.8 19.3
14. 17.4 24.8 19.8 33.4 21.4 20. 22. 17.4 23.9 17.5 21.1 28.4 19.4
23.1 22.2 21.4 8.7 24.6 35.1 31.6 33.3 12. 18.9 21.2 23.5 12.7 16.6
20.3 25. 26.7 36.2 20.6 20.5 24.3 8.3 26.4 24.4 16.1 44. 23.7 17.
23.8 35.2 50. 23.3 42.3 18.9 43.5 14.4 17.9 24.7 21.7 34.6 33.2 21.
21.2 24.3 20.2 16.7 18.7 32.2 19.5 17.2 23.6 29.4 25.2 17.8 43.8 14.3
23.2 18.2 17.1 22.7 19.3 11.5 15.6 44.8 12.3 28.2 21.7 14.3 18.6 23.3
19.5 15.6 21.7 38.7 16.6 23. 50. 24.2 29.8 17.5 20.9 18.8 23.9 19.4
21.7 17.6 50. 36.4 23.2 24.7 22.8 13.8 36.1 20.8 22.5 19.1 23. 18.1
27.1 10.8 12.5 19.6 18.9 19.1 20.5 13.9 46. 15.3 5. 8.5 22.4 37.9
15. 18.3 23.2 20.4 50. 14.5 21. 20.6 21.6 13.1 20.8 15. 22.6 15.1
32.9 17.3 27.5 22.8 29.6 46.7 29.1 27.5 15.6 14.4 24.5 33.1 9.6 20.2
8.3 24.7 7. 18.6 23.9 17.7 28.7 12.8 19.4 19. 22.5 33.1 15.2 12.6
19.9 20. 11.7 30.1 11.9 20.8 29.1 23.1 23.1 37.3 22.8 50. 11.3 10.2
15.2 13.1 21.9 48.5 8.4 24.8 17.5 20.6 21.7 28. 13.3 24.4 19.7 16.8
13.4 7.4 11.9 20.3 10.2 31. 19.8 22. 8.8 8.8 8.4 29. 19.4 42.8
28.7] [ 7.5 50. 20.4 21.5 19.6 23.2 21. 9.7 16. 24.3 17.8 18.5 30.1 17.8
19.7 9.5 14.6 14.6 13.8 37.6 13.2 7. 14.9 34.7 41.7 10.4 19.4 24.1
22.2 33. 22.2 22.6 31.1 7.2 20.1 31.5 18.7 14.9 34.9 25. 25. 17.8
21.9 24.8 22. 43.1 23.8 15.6 14.2 37.2 21.8 10.9 26.6 10.2 15. 14.9
29.9 24.4 25. 27. 18.2 19.5 22.6 50. 20.4 8.1 32. 28.4 23.4 20.9
34.9 6.3 50. 23.3 28.1 18.5 35.4 19.1 25. 10.9 23.7 15.7 14.1 19.3
28.7 21.1 23.1 32.4 13.6 21.7 17.8 18.2 13.3 22.8 7.2 16.3 24.6 20.1
18.7 16.8 14.1 39.8 37. 18.4 11.8 28.6 24.5 32. 10.5 21.5 18.9 27.5
21.4 23.3 23.7 14.1 22.9 11. 17.2 20.6 19.4 28.5 23.9 19.6 13.5 36.2
18.5]
[[-0.09931161 0.11737198 0.06600954 0.05114192 -0.2353607 0.36107615
-0.03144825 -0.32936315 0.27988248 -0.24871072 -0.24014183 0.07154033
-0.33709119]]
正规方程测试集里面每个房子的预测价格: [[14.90083863]
[21.45550438]
[20.23653511]
[24.3668897 ]
[19.46274317]
[22.47756857]
[20.74189741]
[ 9.23524909]
[17.91741986]
[28.83380428]
[16.80469416]
[18.75821888]
[34.56702034]
[10.14091632]
[13.64069403]
[13.98624382]
[ 8.73135537]
[19.07011958]
[ 0.35233438]
[38.03862163]
[ 9.56670764]
[ 8.54927019]
[16.18296215]
[30.87896973]
[38.85212256]
[ 7.22322342]
[25.19188637]
[20.18314555]
[24.28171927]
[23.62469733]
[23.48090667]
[25.766219 ]
[32.60859639]
[17.82873813]
[23.59214032]
[32.89955177]
[20.47162198]
[17.56849696]
[34.82094629]
[29.52754365]
[25.75724136]
[18.39497371]
[14.0717522 ]
[25.86066463]
[20.86052186]
[37.45539866]
[26.91734363]
[15.99033329]
[17.61816632]
[32.86184667]
[20.03891532]
[14.63594452]
[28.7427995 ]
[ 6.06279731]
[25.7024351 ]
[17.19987588]
[31.24590846]
[22.93399133]
[28.62044008]
[31.82958502]
[17.60965797]
[16.68738272]
[18.51205523]
[33.0657219 ]
[18.80112336]
[ 4.19924828]
[33.46461689]
[28.42551355]
[24.61923213]
[20.45912152]
[30.18610693]
[11.53715076]
[30.91746843]
[21.47142351]
[25.6826612 ]
[18.69988337]
[34.75113307]
[24.41473628]
[24.69527294]
[18.76883797]
[27.6427222 ]
[16.12522634]
[17.20024941]
[20.77106546]
[24.71297785]
[20.24208813]
[20.01943651]
[35.36633304]
[12.03449632]
[25.01542485]
[23.08886346]
[13.37950236]
[14.14423222]
[24.79191098]
[11.01749482]
[11.38485966]
[24.74395084]
[20.37415752]
[20.80827112]
[20.0334627 ]
[19.79940643]
[35.22354243]
[31.10450992]
[19.23473387]
[12.94368282]
[29.41871272]
[21.28549158]
[33.86038279]
[12.51828535]
[20.90196745]
[19.0124359 ]
[20.7542053 ]
[21.51651227]
[25.43503143]
[28.68887584]
[16.48398304]
[29.45069112]
[14.97352888]
[13.81917348]
[19.3395977 ]
[19.38475254]
[33.50424621]
[27.3866856 ]
[20.55355965]
[13.14629917]
[27.29860269]
[24.70098215]]
正规方程的均方误差: 26.95743723532586
[-0.08089631 0.08305502 0.00229358 0.05560857 -0.15161084 0.38467429
-0.03671536 -0.27094109 0.12200439 -0.0960882 -0.22105136 0.070405
-0.32895531]
梯度下降测试集里面每个房子的预测价格: [15.06976065 20.2088158 20.56443165 23.48270958 20.06533935 21.87181058
21.45740467 8.96173986 17.45932212 28.24853933 17.25296557 18.61733904
34.32338504 11.30541304 14.11611892 14.01242848 9.80556737 18.55829796
-0.31805963 37.75848531 10.20339725 10.48839829 16.63676785 31.02960175
38.86365423 7.18023406 24.57902454 22.12237201 24.04069773 24.47691231
23.23457248 25.3671255 32.36139915 17.83873905 23.07885329 32.72065246
20.95898498 17.86712541 35.03085631 27.96286506 25.41115816 19.06904428
13.64000355 25.59403668 21.26496644 37.47131136 26.57985251 17.1670414
17.60714806 32.35256306 20.43402566 15.03856806 28.70915626 5.79775491
25.81872063 17.52533782 31.16579385 22.69025099 27.96734679 31.47249699
17.94064388 16.98778237 18.983096 32.5931768 19.41389012 6.12464788
33.36100845 28.56882401 24.3118938 20.85406349 30.96642134 11.44976272
30.3795102 21.35442065 25.28772709 18.70735928 33.87163409 24.02124497
24.22526068 18.7165066 27.42611345 13.96933205 17.36393766 22.17081228
25.01689739 20.40488337 20.62161483 34.48539034 12.38544983 24.15394825
23.63437392 14.43584204 14.52905347 24.40170552 11.23839626 10.67413323
25.03250751 20.48892595 20.61806124 20.53682527 19.31354378 34.913274
31.76962899 19.94322788 13.44093825 28.85430871 21.59447647 32.87675188
12.94340026 22.19790307 19.33369303 20.26153806 20.99878358 26.80596899
28.27210333 16.77377911 27.93426915 14.89646913 12.87801003 18.82498733
19.86058352 32.18018918 27.85363854 20.13386669 13.41453229 26.5175422
25.10881009]
梯度下降的均方误差: 27.177392442065077
/Users/lichengxiang/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
return f(*args, **kwargs)
[[-0.09827997 0.1155179 0.06199445 0.05157814 -0.23116272 0.36169806
-0.03194549 -0.32582911 0.26970977 -0.23900371 -0.23877386 0.07141578
-0.33578102]]
岭回归测试集里面每个房子的预测价格: [[14.90773074]
[21.37212855]
[20.26531916]
[24.32346584]
[19.49530384]
[22.43500234]
[20.78356203]
[ 9.22918108]
[17.90212304]
[28.78531759]
[16.83654349]
[18.75944467]
[34.52845237]
[10.2128046 ]
[13.68753305]
[13.98639829]
[ 8.80147102]
[19.03569017]
[ 0.34647128]
[37.99319218]
[ 9.62943132]
[ 8.69129363]
[16.19751766]
[30.87588131]
[38.82111026]
[ 7.23172921]
[25.17786218]
[20.31168821]
[24.27561639]
[23.67519738]
[23.46634682]
[25.74535481]
[32.59964787]
[17.82054214]
[23.57272116]
[32.86969534]
[20.50859086]
[17.57304714]
[34.83085334]
[29.43579631]
[25.72751158]
[18.42649693]
[14.04430798]
[25.85251694]
[20.90097805]
[37.42657403]
[26.88681826]
[16.04640009]
[17.60671073]
[32.81989627]
[20.0687788 ]
[14.64916726]
[28.73793404]
[ 6.06398514]
[25.68435048]
[17.20756708]
[31.23069034]
[22.93444577]
[28.56360209]
[31.79840422]
[17.63810997]
[16.72205564]
[18.52233196]
[33.01348897]
[18.84145678]
[ 4.35016558]
[33.4610232 ]
[28.43471582]
[24.61006252]
[20.49898432]
[30.24335363]
[11.5407095 ]
[30.8657765 ]
[21.47037443]
[25.66066566]
[18.70553915]
[34.68903821]
[24.3860046 ]
[24.67017836]
[18.75708044]
[27.63381636]
[16.01015772]
[17.19966781]
[20.87008934]
[24.72985286]
[20.2543652 ]
[20.05939638]
[35.31614261]
[12.07475911]
[24.97035014]
[23.09754596]
[13.47050165]
[14.17844332]
[24.77721591]
[11.03350742]
[11.35175944]
[24.75993917]
[20.405458 ]
[20.79987687]
[20.06756633]
[19.76569775]
[35.18679097]
[31.14520474]
[19.27827231]
[12.96714836]
[29.38084052]
[21.31152874]
[33.79590755]
[12.53910656]
[20.95634851]
[19.03917432]
[20.71778686]
[21.47605149]
[25.48251652]
[28.66756915]
[16.49066919]
[29.36014145]
[14.96756718]
[13.7755368 ]
[19.30927495]
[19.42263635]
[33.42785144]
[27.41701648]
[20.53792962]
[13.15683563]
[27.2534482 ]
[24.7249328 ]]
岭回归的均方误差: 26.934821572805753
进程已结束,退出代码0
四。过拟合与欠拟合
过拟合:一个假设在训练数据上能够获得比其他假设更好的拟合, 但是在训练数据外的数据集上却不能很好地拟合数据,此时认为这个假设出现了过拟合的现象。(模型过于复杂)
欠拟合:一个假设在训练数据上不能获得更好的拟合, 但是在训练数据外的数据集上也不能很好地拟合数据,此时认为这个假设出现了欠拟合的现象。(模型过于简单)
1.欠拟合原因以及解决办法
•原因:
•学习到数据的特征过少
•解决办法:
•增加数据的特征数量
2.过拟合原因以及解决办法
•原因:
•原始特征过多,存在一些嘈杂特征,模型过于复杂是因为模型尝试去兼顾各个测试数据点
•解决办法:
•进行特征选择,消除关联性大的特征(很难做)
•交叉验证(让所有数据都有过训练),交叉验证能测出对与训练集的拟合度,若拟合度较差,则一定是欠拟合
•正则化(了解)
作用:可以使得W的每个元素都很小,都接近于0
优点:越小的参数说明模型越简单,越简单的模型则越不容易产生过拟合现象
五。带有正则化的线性回归-岭回归
•sklearn.linear_model.Ridge(alpha=1.0)
•具有l2正则化的线性最小二乘法
•alpha:正则化力度
•coef_:回归系数
•岭回归:回归得到的回归系数更符合实际,更可靠。另外,能让估计参数的波动范围变小,变的更稳定。在存在病态数据偏多的研究中有较大的实用价值。