Python机器学习/数据挖掘项目实战 波士顿房价预测 回归分析
- 此数据源于美国某经济学杂志上,分析研究波士顿房价( Boston HousePrice)的数据集。
- 在这个项目中,你将利用马萨诸塞州波士顿郊区的房屋信息数据训练和测试一个模型,并对模型的性能和预测能力进行测试。通过该数据训练后的好的模型可以被用来对房屋做特定预测—尤其是对房屋的价值。对于房地产经纪等人的日常工作来说,这样的预测模型被证明非常有价值。
数据集说明
- 数据集中的每一行数据都是对波士顿周边或城镇房价的情况描述,对数据集变量说明如下。
CRIM: 城镇人均犯罪率
ZN: 住宅用地所占比例
INDUS: 城镇中非住宅用地所占比例
CHAS: 虚拟变量,用于回归分析
NOX: 环保指数
RM: 每栋住宅的房间数
AGE: 1940 年以前建成的自住单位的比例
DIS: 距离 5 个波士顿的就业中心的加权距离
RAD: 距离高速公路的便利指数
TAX: 每一万美元的不动产税率
PTRATIO: 城镇中的教师学生比例
B: 城镇中的黑人比例
LSTAT: 地区中有多少房东属于低收入人群
MEDV: 自住房屋房价中位数(也就是均价)
- 原文:
print (boston_data['DESCR'])
Boston House Prices dataset
===========================
Notes
------ Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
导入库
from sklearn.datasets import load_boston
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
载入数据集
boston_data=load_boston()
x_data = boston_data.data
y_data = boston_data.target
names=boston_data.feature_names
FeaturesNums = 13
DataNums = len(x_data)
可视化分析
- 将数据集各个特征可视化
- 分析相关性后再进行数据处理
- 处理后继续可视化
- 可视化再反馈给数据处理
- 若数据满意,则尝试建模
以下的数据图是经过筛选后的特征数据所得
特征与标签关系
- 观察特征与标签关系
- 分析特征对于标签的贡献程度
# 每个Feature和target二维关系图
plt.subplots(figsize=(20,12))
for i in range(FeaturesNums):
plt.subplot(231+i)
plt.scatter(x_train[:,i],y_train,s=20,color='blueviolet')
plt.title(names[i])
plt.show()
特征数据分布
- 数据分布能够估计数据价值
- 也能发现异常数据
plt.subplots(figsize=(20,10))
for i in range(FeaturesNums):
plt.subplot(231+i)
plt.hist(x_data[:,i],color='lightseagreen',width=2)
plt.xlabel(names[i])
plt.title(names[i])
plt.show()
数据处理
- 导入sklearn中的预处理库
- 多种处理方式
from sklearn import preprocessing
清除异常值
DelList0=[]
for i in range(DataNums):
if (y_data[i] >= 49 or y_data[i] <= 1):
DelList0.append(i)
DataNums -= len(DelList0)
x_data = np.delete(x_data,DelList0,axis=0)
y_data = np.delete(y_data,DelList0,axis=0)
去除无用特征
DelList1=[]
for i in range(FeaturesNums):
if (names[i] == 'ZN' or
names[i] == 'INDUS' or
names[i] == 'RAD' or
names[i] == 'TAX' or
names[i] == 'CHAS' or
names[i] == 'NOX' or
names[i] == 'B' or
names[i] == 'PTRATIO'):
DelList1.append(i)
x_data = np.delete(x_data, DelList1, axis=1)
names = np.delete(names, DelList1)
FeaturesNums -= len(DelList1)
归一化
from sklearn.preprocessing import MinMaxScaler, scale
nms = MinMaxScaler()
x_train = nms.fit_transform(x_train)
x_test = nms.fit_transform(x_test)
y_train = nms.fit_transform(y_train.reshape(-1,1))
y_test = nms.fit_transform(y_test.reshape(-1,1))
数据分割
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3)
训练模型
- 尝试多种模型
- 择优选取
线性回归LinearRegression
- 用线性回归模型训练
- 查看MSE和R2得分
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print ("MSE =", mean_squared_error(y_test, y_pred),end='\n\n')
print ("R2 =", r2_score(y_test, y_pred),end='\n\n')
MSE = 0.013304697805737791
R2 = 0.44625845284900767
- 可视化结果
# 画图
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, c="blue", edgecolors="aqua",s=13)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k', lw=2, color='navy')
ax.set_xlabel('Reality')
ax.set_ylabel('Prediction')
plt.show()
SVR模型linear核
- 用SVR模型linear核模型训练
- 查看得分
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
linear_svr = SVR(kernel='linear')
# linear_svr.fit(x_train, y_train)
# linear_pred = linear_svr.predict(x_test)
linear_svr_pred = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()
print ("Linear_SVR_Score =",linear_svr_meanscore,end='\n')
Linear_SVR_Score = 0.6497361775614359
SVR模型poly核
- 用SVR模型poly核模型训练
- 查看得分
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_predict, cross_val_score
poly_svr = SVR(kernel='poly')
poly_svr.fit(x_train, y_train)
poly_pred = poly_svr.predict(x_test)
poly_svr_pred = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()
print ('\n',"Poly_SVR_Score =",poly_svr_meanscore,end='\n')
Poly_SVR_Score = 0.5383303049258509
总结
- 通过数据处理和可视化分析,能够相互反馈筛选有效特征
- 采用多种模型进行尝试,选择得分最优的模型进行最终的训练
- 最后,SVR模型linear核效果最佳