python 随机森林代码 python 随机森林参数

转载

云端小梦 2023-10-03 20:31:04

文章标签 python 随机森林代码数据复杂度泛化 文章分类 Python 后端开发

主要从影响随机森林的参数入手调整随机森立的预测程度：

python 随机森林代码 python 随机森林参数_数据

Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.6.1 -- An enhanced Interactive Python.

python 随机森林代码 python 随机森林参数_数据_02

对树模型来说，树越茂盛，深度越深，枝叶越多，模型就越复杂。所以树模型是天生位于图的右上角的模型，随机森林是以树模型为基础，所以随机森林也是天生复杂度高的模型。随机森林的参数，都是向着一个目标去：减少模型的复杂度，把模型往图像的左边移动，防止过拟合。当然了，调参没有绝对，也有天生处于图像左边的随机森林，所以调参之前，我们要先判断，模型现在究竟处于图像的哪一边。

1）模型太复杂或者太简单，都会让泛化误差高，我们追求的是位于中间的平衡点
2）模型太复杂就会过拟合，模型太简单就会欠拟合
3）对树模型和树的集成模型来说，树的深度越深，枝叶越多，模型越复杂
4）树模型和树的集成模型的目标，都是减少模型复杂度，把模型往图像的左边移动

1. 导入需要的库
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

2. 导入数据集，探索数据

data = load_breast_cancer()
 data.data.shape
 Out[2]: (569, 30)data.target
 Out[3]: 
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        ......
        1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

3. 进行一次简单的建模，看看模型本身在数据集上的效果
rfc = RandomForestClassifier(n_estimators=100,random_state=90)
score_pre = cross_val_score(rfc,data.data,data.target,cv=10).mean()
score_pre

Out[4]: 0.9666925935528475

4. 随机森林调整的第一步：无论如何先来调n_estimators，以10为分隔点

scorel = []
 for i in range(0,200,10):
     rfc = RandomForestClassifier(n_estimators=i+1,
                                  n_jobs=-1,
                                  random_state=90)
     score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
     scorel.append(score)print(max(scorel),(scorel.index(max(scorel))*10)+1)
 plt.figure(figsize=[20,5])
 plt.plot(range(1,201,10),scorel)
 plt.show()0.9684480598046841 41

python 随机森林代码 python 随机森林参数_数据_03

5. 根据上面的显示最优点在41附近，进一步细化学习曲线

scorel = []
 for i in range(35,45):
     rfc = RandomForestClassifier(n_estimators=i,
                                  n_jobs=-1,
                                  random_state=90)
     score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
     scorel.append(score)print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
 plt.figure(figsize=[20,5])
 plt.plot(range(35,45),scorel)
 plt.show()
0.9719568317345088 39

调整n_estimators的效果显著，模型的准确率立刻上升了0.005

我们将使用网格搜索对参数一个个进行调整

7. 调整max_depth
param_grid = {'max_depth':np.arange(1, 20, 1)}
# 一般根据数据的大小来进行一个试探，乳腺癌数据很小，所以可以采用1~10，或者1~20这样的试探
# 但对于像digit recognition那样的大型数据来说，我们应该尝试30~50层深度（或许还不足够
# 更应该画出学习曲线，来观察深度对模型的影响

rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[7]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[8]: {'max_depth': 11}GS.best_score_
 Out[9]: 0.9718804920913884

将max_depth设置为有限之后，模型的准确率下降了。限制max_depth，是让模型变得简单，把模型向左推，而模型整体的准确率下降了，即整体的泛化误差上升了，这说明模型现在位于图像左边，即泛化误差最低点的左边（偏差为主导的一边）

当模型位于图像左边时，我们需要的是增加模型复杂度（增加方差，减少偏差）的选项，因此max_depth应该尽量
大，min_samples_leaf和min_samples_split都应该尽量小。这几乎是在说明，除了max_features，我们没有任何
参数可以调整了，因为max_depth，min_samples_leaf和min_samples_split是剪枝参数，是减小复杂度的参数。
在这里，我们可以预言，我们已经非常接近模型的上限，模型很可能没有办法再进步了。

8. 调整max_features

param_grid = {'max_features':np.arange(5,30,1)}rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[11]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'max_features': array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
        22, 23, 24, 25, 26, 27, 28, 29])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[12]: {'max_features': 5}GS.best_score_
Out[13]: 0.9718804920913884

网格搜索返回了max_features的最小值，可见max_features升高之后，模型的准确率降低了。这说明，我们把模
型往右推，模型的泛化误差增加了。前面用max_depth往左推，现在用max_features往右推，泛化误差都增加，
这说明模型本身已经处于泛化误差最低点，已经达到了模型的预测上限，没有参数可以左右的部分了

9. 调整min_samples_leaf
param_grid={'min_samples_leaf':np.arange(1, 1+10, 1)}
#对于min_samples_split和min_samples_leaf,一般是从他们的最小值开始向上增加10或20
#面对高维度高样本量数据，如果不放心，也可以直接+50，对于大型数据，可能需要200~300的范围
#如果调整的时候发现准确率无论如何都上不来，那可以放心大胆调一个很大的数据，大力限制模型的复杂度

rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[14]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'min_samples_leaf': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[15]: {'min_samples_leaf': 1}GS.best_score_
Out[16]: 0.9718804920913884

10.调整min_samples_split

param_grid={'min_samples_split':np.arange(2, 2+20, 1)}
 rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                                  )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 H:\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
   DeprecationWarning)
 Out[18]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'min_samples_split': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21])},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[19]: {'min_samples_split': 2}GS.best_score_
Out[20]: 0.9718804920913884

11. 调整criterion

param_grid = {'criterion':['gini', 'entropy']}
 rfc = RandomForestClassifier(n_estimators=39
                              ,random_state=90
                              )
 GS = GridSearchCV(rfc,param_grid,cv=10)
 GS.fit(data.data,data.target)
 Out[21]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features='auto',
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               n_estimators=39, n_jobs=None,
                                               oob_score=False, random_state=90,
                                               verbose=0, warm_start=False),
              iid='warn', n_jobs=None,
              param_grid={'criterion': ['gini', 'entropy']},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)GS.best_params_
 Out[22]: {'criterion': 'gini'}GS.best_score_
Out[23]: 0.9718804920913884

12. 调整完毕，总结出模型的最佳参数

rfc = RandomForestClassifier(n_estimators=39,random_state=90)
 score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
 score
 Out[24]: 0.9719568317345088

13.数据的提升程度

score - score_pre
Out[25]: 0.005264238181661218

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。