主要从影响随机森林的参数入手调整随机森立的预测程度:

python 随机森林 代码 python 随机森林 参数_数据

Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.6.1 -- An enhanced Interactive Python.

                             

python 随机森林 代码 python 随机森林 参数_数据_02

对树模型来说,树越茂盛,深度越深,枝叶越多,模型就越复杂。所以树模型是天生位于图的右上角的模型,随机森林是以树模型为基础,所以随机森林也是天生复杂度高的模型。随机森林的参数,都是向着一个目标去:减少模型的复杂度,把模型往图像的左边移动,防止过拟合。当然了,调参没有绝对,也有天生处于图像左边的随机森林,所以调参之前,我们要先判断,模型现在究竟处于图像的哪一边。

1)模型太复杂或者太简单,都会让泛化误差高,我们追求的是位于中间的平衡点
2)模型太复杂就会过拟合,模型太简单就会欠拟合
3)对树模型和树的集成模型来说,树的深度越深,枝叶越多,模型越复杂
4)树模型和树的集成模型的目标,都是减少模型复杂度,把模型往图像的左边移动   
 

 

1. 导入需要的库
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

 

 

2. 导入数据集,探索数据

data = load_breast_cancer()
 data.data.shape
 Out[2]: (569, 30)data.target
 Out[3]: 
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        ......
        1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

 

3. 进行一次简单的建模,看看模型本身在数据集上的效果
rfc = RandomForestClassifier(n_estimators=100,random_state=90)
score_pre = cross_val_score(rfc,data.data,data.target,cv=10).mean()
score_pre

Out[4]: 0.9666925935528475

 

 

4. 随机森林调整的第一步:无论如何先来调n_estimators,以10为分隔点

scorel = []
 for i in range(0,200,10):
     rfc = RandomForestClassifier(n_estimators=i+1,
                                  n_jobs=-1,
                                  random_state=90)
     score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
     scorel.append(score)print(max(scorel),(scorel.index(max(scorel))*10)+1)
 plt.figure(figsize=[20,5])
 plt.plot(range(1,201,10),scorel)
 plt.show()0.9684480598046841 41


python 随机森林 代码 python 随机森林 参数_数据_03

 

 

5. 根据上面的显示最优点在41附近,进一步细化学习曲线

scorel = []
 for i in range(35,45):
     rfc = RandomForestClassifier(n_estimators=i,
                                  n_jobs=-1,
                                  random_state=90)
     score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
     scorel.append(score)print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
 plt.figure(figsize=[20,5])
 plt.plot(range(35,45),scorel)
 plt.show()
0.9719568317345088 39

调整n_estimators的效果显著,模型的准确率立刻上升了0.005

我们将使用网格搜索对参数一个个进行调整

7. 调整max_depth
param_grid = {'max_depth':np.arange(1, 20, 1)}
# 一般根据数据的大小来进行一个试探,乳腺癌数据很小,所以可以采用1~10,或者1~20这样的试探
# 但对于像digit recognition那样的大型数据来说,我们应该尝试30~50层深度(或许还不足够
# 更应该画出学习曲线,来观察深度对模型的影响

rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[7]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[8]: {'max_depth': 11}GS.best_score_
 Out[9]: 0.9718804920913884

将max_depth设置为有限之后,模型的准确率下降了。限制max_depth,是让模型变得简单,把模型向左推,而模型整体的准确率下降了,即整体的泛化误差上升了,这说明模型现在位于图像左边,即泛化误差最低点的左边(偏差为主导的一边)
 

当模型位于图像左边时,我们需要的是增加模型复杂度(增加方差,减少偏差)的选项,因此max_depth应该尽量
大,min_samples_leaf和min_samples_split都应该尽量小。这几乎是在说明,除了max_features,我们没有任何
参数可以调整了,因为max_depth,min_samples_leaf和min_samples_split是剪枝参数,是减小复杂度的参数。
在这里,我们可以预言,我们已经非常接近模型的上限,模型很可能没有办法再进步了。

 

8. 调整max_features

param_grid = {'max_features':np.arange(5,30,1)}rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[11]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'max_features': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
        22, 23, 24, 25, 26, 27, 28, 29])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[12]: {'max_features': 5}GS.best_score_
Out[13]: 0.9718804920913884

网格搜索返回了max_features的最小值,可见max_features升高之后,模型的准确率降低了。这说明,我们把模
型往右推,模型的泛化误差增加了。前面用max_depth往左推,现在用max_features往右推,泛化误差都增加,
这说明模型本身已经处于泛化误差最低点,已经达到了模型的预测上限,没有参数可以左右的部分了

 

9. 调整min_samples_leaf
param_grid={'min_samples_leaf':np.arange(1, 1+10, 1)}
#对于min_samples_split和min_samples_leaf,一般是从他们的最小值开始向上增加10或20
#面对高维度高样本量数据,如果不放心,也可以直接+50,对于大型数据,可能需要200~300的范围
#如果调整的时候发现准确率无论如何都上不来,那可以放心大胆调一个很大的数据,大力限制模型的复杂度

rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[14]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'min_samples_leaf': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[15]: {'min_samples_leaf': 1}GS.best_score_
Out[16]: 0.9718804920913884

 

10.调整min_samples_split

param_grid={'min_samples_split':np.arange(2, 2+20, 1)}
 rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                                  )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[18]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'min_samples_split': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[19]: {'min_samples_split': 2}GS.best_score_
Out[20]: 0.9718804920913884

 

11. 调整criterion

param_grid = {'criterion':['gini', 'entropy']}
 rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 Out[21]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'criterion': ['gini', 'entropy']},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[22]: {'criterion': 'gini'}GS.best_score_
Out[23]: 0.9718804920913884

 

12. 调整完毕,总结出模型的最佳参数

rfc = RandomForestClassifier(n_estimators=39,random_state=90)
 score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
 score
 Out[24]: 0.9719568317345088

 

13.数据的提升程度

score - score_pre
Out[25]: 0.005264238181661218