python机器学习模板
1.定义问题
- 导入类库
- 导入数据集
2.理解数据
- 描述性统计
- 数据可视化
3.数据准备
- 数据清洗
- 特征选择
- 数据转换
4.评估算法
- 分离数据集
- 定义模型评估标准
- 算法审查
- 算法比较
5.优化模型
- 算法调参
- 集成算法
6.结果部署
- 预测评估数据集
- 利用整个数据集生成模型
- 序列化模型
项目实例——波士顿房价
定义问题
本例中分析波士顿房价的数据集,数据集包含14个特共506条数据。
- CRIM: 城镇人均犯罪率
- ZN: 住宅用地所占比例
- INDUS: 城镇中非住宅用地所占比例
- CHAS: 虚拟变量,用于回归分析
- RM: 每栋住宅的房间数
- AGE: 1940年以前建成的自住单位比例
- DIS: 距离5个波士顿就业中心的加权距离
- RAD: 距高速公路便利指数
- TAX: 每一万美元的不动产税率
- PRTATIO: 城镇中的教师学生比例
- B: 城镇中黑人比例
- LSTAT: 地区中有多少房东属于低收入人群
- MEDV: 自住房屋房价中位数
导入数据
导入类库
from pandas import read_csv
import numpy as np
# np.set_printoptions(threshold=np.inf)
from matplotlib import pyplot
from pandas import set_option
from pandas.plotting import scatter_matrix
from numpy import arange
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
下面导入数据集并命名每个数据属性:
#导入数据
filename = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
names = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis',
'rad', 'tax', 'prtitio', 'b', 'lstat', 'medv']
data = read_csv(filename, names= names, delim_whitespace=True)
CSV文件是使用空格键做分隔符的,因此读入CSV文件时要指定分隔符为空格键(delim_whitescape=True)
理解数据
对数据进行分析以选取合适的模型。
#数据维度
print (data.shape)
#(506,14)
#特征属性的字段类型
print (data.dtypes)
#
crim float64
zn float64
indus float64
chas int64
nox float64
rm float64
age float64
dis float64
rad int64
tax float64
prtitio float64
b float64
lstat float64
medv float64
dtype: object
下面查看一些数据集中的记录:
#查看最开始的前十条记录
set_option('display.width', 12000)
set_option('display.width', 1000, 'display.max_rows', 1000)
set_option('display.max_columns',1000)
set_option('display.width', 1000)
set_option('display.max_colwidth',1000)
print(data.head(10))
#
crim zn indus chas nox rm age dis rad tax prtitio b lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222.0 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311.0 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311.0 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311.0 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311.0 15.2 386.71 17.10 18.9
接下来查看统计性数据,包括最大值,最小值,中位数,四分位数,加强对数据结构理解:
print (data.describe())
##
crim zn indus chas nox rm age dis rad tax prtitio b lstat medv
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677082 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
接下来查看数据之间的两两关联关系,这里利用皮尔逊相关系数:
print (data.corr(method='pearson'))
##
crim zn indus chas nox rm age dis rad tax prtitio b lstat medv
crim 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305
zn -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
indus 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
chas -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
nox 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
rm -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
age 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
dis -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
rad 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
tax 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
prtitio 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
b -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
lstat 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
medv -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
print (data.corr(method=‘pearson’))
#数据的可视化
data.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1)
pyplot.show()
data.plot(kind='density', subplots=True, layout=(4,4), sharex=False)
pyplot.show()
scatter_matrix(data)
pyplot.show()
# 分离数据集
array = data.values
X = array[:, 0:13]
Y = array[:, 13]
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X,
Y, test_size=validation_size, random_state=seed)
采用10折交叉验证来分离数据,通过均方误差来比较算法准确度,均方误差越接近0,算法的准确度越高。
#评估算法,评估标准
num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'
对原始数据不做处理,先对算法进行评估,得到一个算法的评估基准,这个基准是对后续算法改善优劣比较的基准。下面是待比较的算法:线性算法:线性回归(LR), 套索回归(LASSO), 弹性网络回顾(EN)非线性回归:分类与回归树(CART),支持向量机(SVM)和K近邻算法(KNN)算法模型初始化的代码:
#评估算法-baseline
models = {}
models['LR'] = LinearRegression()
models['LASSO'] = Lasso()
models['EN'] = ElasticNet()
models['KNN'] = KNeighborsRegressor()
models['CART'] = DecisionTreeRegressor()
models['SVM'] = SVR()
这里的算法准确度以均方误差的均值和标准方差衡量:
results = []
for key in models:
kfold = KFold(n_splits=num_folds, random_state=seed)
cv_result = cross_val_score(models[key], X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_result)
print('%s: %f (%f)' % (key, cv_result.mean(), cv_result.std()))
从执行结果来看,线性回归(LR)具有最好的MSE,接下来是分类与回归树(CART).
LR: -21.379856 (9.414264)
LASSO: -26.423561 (11.651110)
EN: -27.502259 (12.305022)
KNN: -41.896488 (13.901688)
CART: -25.610510 (11.840165)
SVM: -85.518342 (31.994798)