python机器学习模板

1.定义问题

  • 导入类库
  • 导入数据集

2.理解数据

  • 描述性统计
  • 数据可视化

3.数据准备

  • 数据清洗
  • 特征选择
  • 数据转换

4.评估算法

  • 分离数据集
  • 定义模型评估标准
  • 算法审查
  • 算法比较

5.优化模型

  • 算法调参
  • 集成算法

6.结果部署

  • 预测评估数据集
  • 利用整个数据集生成模型
  • 序列化模型

项目实例——波士顿房价

定义问题

       本例中分析波士顿房价的数据集,数据集包含14个特共506条数据。

  • CRIM: 城镇人均犯罪率
  • ZN: 住宅用地所占比例
  • INDUS: 城镇中非住宅用地所占比例
  • CHAS: 虚拟变量,用于回归分析
  • RM: 每栋住宅的房间数
  • AGE: 1940年以前建成的自住单位比例
  • DIS: 距离5个波士顿就业中心的加权距离
  • RAD: 距高速公路便利指数
  • TAX: 每一万美元的不动产税率
  • PRTATIO: 城镇中的教师学生比例
  • B: 城镇中黑人比例
  • LSTAT: 地区中有多少房东属于低收入人群
  • MEDV: 自住房屋房价中位数

导入数据

导入类库

from pandas import read_csv
import numpy as np
# np.set_printoptions(threshold=np.inf)
from matplotlib import pyplot
from pandas import set_option
from pandas.plotting import scatter_matrix
from numpy import arange
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error

下面导入数据集并命名每个数据属性:
#导入数据

filename = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
names = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 
          'rad', 'tax', 'prtitio', 'b', 'lstat', 'medv']
data = read_csv(filename, names= names, delim_whitespace=True)

CSV文件是使用空格键做分隔符的,因此读入CSV文件时要指定分隔符为空格键(delim_whitescape=True)

理解数据

对数据进行分析以选取合适的模型。

#数据维度
print (data.shape)
#(506,14)

#特征属性的字段类型
print (data.dtypes)
#
crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax        float64
prtitio    float64
b          float64
lstat      float64
medv       float64
dtype: object

下面查看一些数据集中的记录:

#查看最开始的前十条记录
set_option('display.width', 12000)
set_option('display.width', 1000, 'display.max_rows', 1000)
set_option('display.max_columns',1000)
set_option('display.width', 1000)
set_option('display.max_colwidth',1000)

print(data.head(10))
#
      crim    zn  indus  chas    nox     rm    age     dis  rad    tax  prtitio       b  lstat  medv
0  0.00632  18.0   2.31     0  0.538  6.575   65.2  4.0900    1  296.0     15.3  396.90   4.98  24.0
1  0.02731   0.0   7.07     0  0.469  6.421   78.9  4.9671    2  242.0     17.8  396.90   9.14  21.6
2  0.02729   0.0   7.07     0  0.469  7.185   61.1  4.9671    2  242.0     17.8  392.83   4.03  34.7
3  0.03237   0.0   2.18     0  0.458  6.998   45.8  6.0622    3  222.0     18.7  394.63   2.94  33.4
4  0.06905   0.0   2.18     0  0.458  7.147   54.2  6.0622    3  222.0     18.7  396.90   5.33  36.2
5  0.02985   0.0   2.18     0  0.458  6.430   58.7  6.0622    3  222.0     18.7  394.12   5.21  28.7
6  0.08829  12.5   7.87     0  0.524  6.012   66.6  5.5605    5  311.0     15.2  395.60  12.43  22.9
7  0.14455  12.5   7.87     0  0.524  6.172   96.1  5.9505    5  311.0     15.2  396.90  19.15  27.1
8  0.21124  12.5   7.87     0  0.524  5.631  100.0  6.0821    5  311.0     15.2  386.63  29.93  16.5
9  0.17004  12.5   7.87     0  0.524  6.004   85.9  6.5921    5  311.0     15.2  386.71  17.10  18.9

接下来查看统计性数据,包括最大值,最小值,中位数,四分位数,加强对数据结构理解:

print (data.describe())
##
             crim          zn       indus        chas         nox          rm         age         dis         rad         tax     prtitio           b       lstat        medv
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000

mean     3.613524   11.363636   11.136779    0.069170    0.554695    6.284634   68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   12.653063   22.532806

std      8.601545   23.322453    6.860353    0.253994    0.115878    0.702617   28.148861    2.105710    8.707259  168.537116    2.164946   91.294864    7.141062    9.197104

min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000    2.900000    1.129600    1.000000  187.000000   12.600000    0.320000    1.730000    5.000000

25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   45.025000    2.100175    4.000000  279.000000   17.400000  375.377500    6.950000   17.025000

50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   77.500000    3.207450    5.000000  330.000000   19.050000  391.440000   11.360000   21.200000

75%      3.677082   12.500000   18.100000    0.000000    0.624000    6.623500   94.075000    5.188425   24.000000  666.000000   20.200000  396.225000   16.955000   25.000000

max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000  100.000000   12.126500   24.000000  711.000000   22.000000  396.900000   37.970000   50.000000

接下来查看数据之间的两两关联关系,这里利用皮尔逊相关系数:

print (data.corr(method='pearson'))
##
             crim        zn     indus      chas       nox        rm       age       dis       rad       tax   prtitio         b     lstat      medv
crim     1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734 -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305
zn      -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537  0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445
indus    0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779 -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725
chas    -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518 -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260
nox      0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470 -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321
rm      -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265  0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360
age      0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000 -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955
dis     -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881  1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929
rad      0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022 -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626
tax      0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456 -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536
prtitio  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515 -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787
b       -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534  0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461
lstat    0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339 -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663
medv    -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955  0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000

print (data.corr(method=‘pearson’))

#数据的可视化
data.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1)
pyplot.show()

data.plot(kind='density', subplots=True, layout=(4,4), sharex=False)
pyplot.show()
scatter_matrix(data)
pyplot.show()

# 分离数据集
array = data.values
X = array[:, 0:13]
Y = array[:, 13]
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X,
    Y, test_size=validation_size, random_state=seed)

采用10折交叉验证来分离数据,通过均方误差来比较算法准确度,均方误差越接近0,算法的准确度越高。

#评估算法,评估标准
num_folds = 10
seed = 7 
scoring = 'neg_mean_squared_error'

对原始数据不做处理,先对算法进行评估,得到一个算法的评估基准,这个基准是对后续算法改善优劣比较的基准。下面是待比较的算法:线性算法:线性回归(LR), 套索回归(LASSO), 弹性网络回顾(EN)非线性回归:分类与回归树(CART),支持向量机(SVM)和K近邻算法(KNN)算法模型初始化的代码:

#评估算法-baseline
models = {}
models['LR'] = LinearRegression()
models['LASSO'] = Lasso()
models['EN'] = ElasticNet()
models['KNN'] = KNeighborsRegressor()
models['CART'] = DecisionTreeRegressor()
models['SVM'] = SVR()

这里的算法准确度以均方误差的均值和标准方差衡量:

results = []
for key in models:
   kfold = KFold(n_splits=num_folds, random_state=seed)
   cv_result = cross_val_score(models[key], X_train, Y_train, cv=kfold, scoring=scoring)
   results.append(cv_result)
   
   print('%s: %f (%f)' % (key, cv_result.mean(), cv_result.std()))

从执行结果来看,线性回归(LR)具有最好的MSE,接下来是分类与回归树(CART).

LR: -21.379856 (9.414264)
LASSO: -26.423561 (11.651110)
EN: -27.502259 (12.305022)
KNN: -41.896488 (13.901688)
CART: -25.610510 (11.840165)
SVM: -85.518342 (31.994798)