剪切文件 Python python剪枝

转载

mob6454cc798a0c 2023-12-26 19:40:16

文章标签 剪切文件 Python python 大数据 Source 决策树 文章分类 Python 后端开发

八个参数：Criterion，两个随机性相关的参数（random_state，splitter），五个剪枝参数（max_depth,
min_samples_split，min_samples_leaf，max_feature，min_impurity_decrease）
一个属性：feature_importances_
四个接口：fit，score，apply，predict

Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.6.1 -- An enhanced Interactive Python.

########################## 分类树 ################

from sklearn import tree
 from sklearn.datasets import load_wine
 from sklearn.model_selection import train_test_splitwine = load_wine()
wine.data.shape
 Out[3]: (178, 13)wine.target
 Out[4]: 
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2])import pandas as pd
 pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)
 Out[5]: 
         0     1     2     3      4     5   ...    8      9     10    11      12  0 
 0    14.23  1.71  2.43  15.6  127.0  2.80  ...  2.29   5.64  1.04  3.92  1065.0   0
 1    13.20  1.78  2.14  11.2  100.0  2.65  ...  1.28   4.38  1.05  3.40  1050.0   0
 2    13.16  2.36  2.67  18.6  101.0  2.80  ...  2.81   5.68  1.03  3.17  1185.0   0
 3    14.37  1.95  2.50  16.8  113.0  3.85  ...  2.18   7.80  0.86  3.45  1480.0   0
 4    13.24  2.59  2.87  21.0  118.0  2.80  ...  1.82   4.32  1.04  2.93   735.0   0
 ..     ...   ...   ...   ...    ...   ...  ...   ...    ...   ...   ...     ...  ..
 173  13.71  5.65  2.45  20.5   95.0  1.68  ...  1.06   7.70  0.64  1.74   740.0   2
 174  13.40  3.91  2.48  23.0  102.0  1.80  ...  1.41   7.30  0.70  1.56   750.0   2
 175  13.27  4.28  2.26  20.0  120.0  1.59  ...  1.35  10.20  0.59  1.56   835.0   2
 176  13.17  2.59  2.37  20.0  120.0  1.65  ...  1.46   9.30  0.60  1.62   840.0   2
 177  14.13  4.10  2.74  24.5   96.0  2.05  ...  1.35   9.20  0.61  1.60   560.0   2[178 rows x 14 columns]

#变量的名称

wine.feature_names
 Out[6]: 
 ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']

#标签的分类

wine.target_names
 Out[7]: array(['class_0', 'class_1', 'class_2'], dtype='<U7')

#分为训练集与测试集

Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
Xtrain.shape
 Out[9]: (124, 13)Xtest.shape
 Out[10]: (54, 13)

#进行预测

clf = tree.DecisionTreeClassifier(criterion="entropy")
 clf = clf.fit(Xtrain, Ytrain)
 score = clf.score(Xtest, Ytest) #返回预测的准确度
 score

Out[11]: 0.9629629629629629

#对决策树进行可视化

feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']
import graphviz
dot_data = tree.export_graphviz(clf
                                 ,out_file = None
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Out[14]:

剪切文件 Python python剪枝_python

#查看变量的重要程度

clf.feature_importances_
 Out[15]: 
 array([0.01424694, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.41643217, 0.        , 0.        , 0.23051758,
        0.        , 0.        , 0.33880331])

#将重要性与名字相对应

[*zip(feature_name,clf.feature_importances_)]
 Out[17]: 
 [('酒精', 0.01424693586672224),
  ('苹果酸', 0.0),
  ('灰', 0.0),
  ('灰的碱性', 0.0),
  ('镁', 0.0),
  ('总酚', 0.0),
  ('类黄酮', 0.41643217174608416),
  ('非黄烷类酚类', 0.0),
  ('花青素', 0.0),
  ('颜色强度', 0.23051757969779874),
  ('色调', 0.0),
  ('od280/od315稀释葡萄酒', 0.0),
  ('脯氨酸', 0.33880331268939484)]

在每次分枝时，不从使用全部特征，而是随
机选取一部分特征，从中选取不纯度相关指标最优的作为分枝用的节点。这样，每次生成的树也就不同了

clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=30)
 clf = clf.fit(Xtrain, Ytrain)
 score = clf.score(Xtest, Ytest) #返回预测的准确度
 score
 Out[18]: 0.9629629629629629

splitter也是用来控制决策树中的随机选项的，有两种输入值，输入”best"，决策树在分枝时虽然随机，但是还是会
优先选择更重要的特征进行分枝，输入“random"，决策树在分枝时会更加随机

clf = tree.DecisionTreeClassifier(criterion="entropy"
                                   ,random_state=30
                                   ,splitter="random"
 )
 clf = clf.fit(Xtrain, Ytrain)
 score = clf.score(Xtest, Ytest)
 score
 Out[19]: 0.9629629629629629import graphviz
 dot_data = tree.export_graphviz(clf
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Out[20]:

剪切文件 Python python剪枝_剪切文件 Python_02

测试集结果

score_train = clf.score(Xtrain, Ytrain)
 score_train
Out[21]: 1.0 
 
#######################剪枝参数
max_depth
 限制树的最大深度，超过设定深度的树枝全部剪掉clf = tree.DecisionTreeClassifier(criterion="entropy"
                                   ,random_state=30
                                   ,splitter="random"
                                   ,max_depth=3)
dot_data = tree.export_graphviz(clf
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Traceback (most recent call last):  File "<ipython-input-22-078a2a9dc9b9>", line 12, in <module>
     ,rounded=True  File "H:\Anaconda3\lib\site-packages\sklearn\tree\export.py", line 756, in export_graphviz
     check_is_fitted(decision_tree, 'tree_')  File "H:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 914, in check_is_fitted
     raise NotFittedError(msg % {'name': type(estimator).__name__})NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

解决方法：

先调用fit方法再进行预测
clf = clf.fit(Xtrain, Ytrain)
在执行上面的脚本
 dot_data = tree.export_graphviz(clf
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Out[26]:

剪切文件 Python python剪枝_大数据_03

min_samples_leaf限定，一个节点在分枝后的每个子节点都必须包含至少min_samples_leaf个训练样本，否则分
枝就不会发生，或者，分枝会朝着满足每个子节点都包含min_samples_leaf个样本的方向去发生

clf = tree.DecisionTreeClassifier(criterion="entropy"
                                   ,random_state=30
                                   ,splitter="random"
                                   ,max_depth=3
                                   ,min_samples_leaf=10)
 clf = clf.fit(Xtrain, Ytrain)
 dot_data = tree.export_graphviz(clf
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Out[27]:

剪切文件 Python python剪枝_决策树_04

samples的数字都大于10，上一副图中不大于10的都变为10以上了。

min_samples_split限定，一个节点必须要包含至少min_samples_split个训练样本，这个节点才允许被分枝，否则
分枝就不会发生。

clf = tree.DecisionTreeClassifier(criterion="entropy"
                                   ,random_state=30
                                   ,splitter="random"
                                   
                                   ,min_samples_split=60)
 clf = clf.fit(Xtrain, Ytrain)
 dot_data = tree.export_graphviz(clf
                                 ,feature_names= feature_name
                                 ,class_names=["琴酒","雪莉","贝尔摩德"]
                                 ,filled=True
                                 ,rounded=True
 )
 graph = graphviz.Source(dot_data)
 graph
 Out[28]:

剪切文件 Python python剪枝_python_05

确认最优的剪枝参数

import matplotlib.pyplot as plt
test = []
for i in range(10):
     clf = tree.DecisionTreeClassifier(max_depth=i+1
                                       ,criterion="entropy"
                                       ,random_state=30
                                       ,splitter="random"
                                       )
     clf = clf.fit(Xtrain, Ytrain)
     score = clf.score(Xtest, Ytest)
     test.append(score)plt.plot(range(1,11),test,color="red",label="max_depth")
 plt.legend()
 plt.show()

剪切文件 Python python剪枝_Source_06

每个数据归属的节点

clf.apply(Xtest)
 Out[36]: 
 array([22, 14, 30, 22, 30, 14, 14, 11, 22, 14, 14, 26, 22, 22, 30, 18, 30,
         9, 14,  4,  4, 14, 30, 26, 30, 14, 11, 30,  4, 30,  4,  4, 26,  8,
         4,  4, 14,  6, 30, 22, 30, 30, 26, 30, 24,  8,  9,  4, 30,  4, 30,
         4,  4, 30], dtype=int64)

每个数据分分类

clf.predict(Xtest)
 Out[37]: 
 array([1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 1, 1, 2, 2, 1,
        0, 0, 0, 1, 1, 0, 2, 0, 2, 2, 0, 2, 2, 2, 1, 2, 0, 1, 0, 0, 0, 0,
        1, 2, 1, 2, 0, 2, 0, 2, 2, 0])

#############################泰坦尼克号数据 #########################

import pandas as pd
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.model_selection import train_test_split
 from sklearn.model_selection import GridSearchCV
 from sklearn.model_selection import cross_val_score
 import matplotlib.pyplot as pltdata = pd.read_csv(r"H:\程志伟\python\菜菜的机器学习skleaen课堂\01 决策树课件数据源码\Taitanic data\data.csv",index_col= 0)
data.head()
 Out[40]: 
              Survived  Pclass  ... Cabin Embarked
 PassengerId                    ...               
 1                   0       3  ...   NaN        S
 2                   1       1  ...   C85        C
 3                   1       3  ...   NaN        S
 4                   1       1  ...  C123        S
 5                   0       3  ...   NaN        S[5 rows x 11 columns]
data.info()
 <class 'pandas.core.frame.DataFrame'>
 Int64Index: 891 entries, 1 to 891
 Data columns (total 11 columns):
  #   Column    Non-Null Count  Dtype  
 ---  ------    --------------  -----  
  0   Survived  891 non-null    int64  
  1   Pclass    891 non-null    int64  
  2   Name      891 non-null    object 
  3   Sex       891 non-null    object 
  4   Age       714 non-null    float64
  5   SibSp     891 non-null    int64  
  6   Parch     891 non-null    int64  
  7   Ticket    891 non-null    object 
  8   Fare      891 non-null    float64
  9   Cabin     204 non-null    object 
  10  Embarked  889 non-null    object 
 dtypes: float64(2), int64(4), object(5)
 memory usage: 83.5+ KB

#删除缺失值过多的列，和观察判断来说和预测的y没有关系的列

data.drop(["Cabin","Name","Ticket"],inplace=True,axis=1)

#处理缺失值，对缺失值较多的列进行填补，有一些特征只确实一两个值，可以采取直接删除记录的方法

data["Age"] = data["Age"].fillna(data["Age"].mean())
 data = data.dropna() #将分类变量转换为数值型变量
 #将二分类变量转换为数值型变量
 #astype能够将一个pandas对象转换为某种类型，和apply(int(x))不同，astype可以将文本类转换为数字，用这
 个方式可以很便捷地将二分类特征转换为0~1
 data["Sex"] = (data["Sex"]== "male").astype("int") #将三分类变量转换为数值型变量
 labels = data["Embarked"].unique().tolist()
 data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x)) #查看处理后的数据集
 data.head()
 Out[47]: 
              Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
 PassengerId                                                              
 1                   0       3    1  22.0      1      0   7.2500         0
 2                   1       1    0  38.0      1      0  71.2833         1
 3                   1       3    0  26.0      0      0   7.9250         0
 4                   1       1    0  35.0      1      0  53.1000         0
 5                   0       3    1  35.0      0      0   8.0500         0

提取标签和特征矩阵，分测试集和训练集

X = data.iloc[:,data.columns != "Survived"]
 y = data.iloc[:,data.columns == "Survived"]from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,y,test_size=0.3)
for i in [Xtrain, Xtest, Ytrain, Ytest]:
     i.index = range(i.shape[0])Xtrain.head()
 Out[52]: 
    Pclass  Sex        Age  SibSp  Parch     Fare  Embarked
 0       2    1   2.000000      1      1  26.0000         0
 1       3    1  26.000000      0      0   7.8958         0
 2       1    0  35.000000      1      0  52.0000         0
 3       1    0  52.000000      1      1  93.5000         0
 4       3    1  29.699118      0      0   8.0500         0

#对测试集进行预测

clf = DecisionTreeClassifier(random_state=25)
 clf = clf.fit(Xtrain, Ytrain)
 score_ = clf.score(Xtest, Ytest)
 score_
 Out[53]: 0.7865168539325843from sklearn.datasets import load_boston
 from sklearn.model_selection import cross_val_score
 from sklearn.tree import DecisionTreeRegressor

#交叉验证

score = cross_val_score(clf,X,y,cv=10).mean()
 score
 Out[55]: 0.7739274770173645

在不同max_depth下观察模型的拟合状况

tr = []
 te = []
 for i in range(10):
     clf = DecisionTreeClassifier(random_state=25
                                  ,max_depth=i+1
                                  ,criterion="entropy"
                                  )
     clf = clf.fit(Xtrain, Ytrain)
     score_tr = clf.score(Xtrain,Ytrain)
     score_te = cross_val_score(clf,X,y,cv=10).mean()
     tr.append(score_tr)
     te.append(score_te)print(max(te))
 0.8177860061287026 
plt.plot(range(1,11),tr,color="red",label="train")
 plt.plot(range(1,11),te,color="blue",label="test")
 plt.xticks(range(1,11))
 plt.legend()
 plt.show()

剪切文件 Python python剪枝_决策树_07

用网格搜索调整参数

import numpy as np
 gini_thresholds = np.linspace(0,0.5,20)
 parameters = {'splitter':('best','random')
     ,'criterion':("gini","entropy")
     ,"max_depth":[*range(1,10)]
     ,'min_samples_leaf':[*range(1,50,5)]
     ,'min_impurity_decrease':[*np.linspace(0,0.5,20)]
     }clf = DecisionTreeClassifier(random_state=25)
 GS = GridSearchCV(clf, parameters, cv=10)
 GS.fit(Xtrain,Ytrain)
 Out[62]: 
 GridSearchCV(cv=10, error_score='raise-deprecating',
              estimator=DecisionTreeClassifier(class_weight=None,
                                               criterion='gini', max_depth=None,
                                               max_features=None,
                                               max_leaf_nodes=None,
                                               min_impurity_decrease=0.0,
                                               min_impurity_split=None,
                                               min_samples_leaf=1,
                                               min_samples_split=2,
                                               min_weight_fraction_leaf=0.0,
                                               presort=False, random_state=25,
                                               splitter='best'),
              iid='warn', n_...
                                                    0.23684210526315788,
                                                    0.2631578947368421,
                                                    0.2894736842105263,
                                                    0.3157894736842105,
                                                    0.3421052631578947,
                                                    0.3684210526315789,
                                                    0.39473684210526316,
                                                    0.42105263157894735,
                                                    0.4473684210526315,
                                                    0.47368421052631576, 0.5],
                          'min_samples_leaf': [1, 6, 11, 16, 21, 26, 31, 36, 41,
                                               46],
                          'splitter': ('best', 'random')},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring=None, verbose=0)

#最优的参数

GS.best_params_
 Out[63]: 
 {'criterion': 'gini',
  'max_depth': 3,
  'min_impurity_decrease': 0.0,
  'min_samples_leaf': 6,
  'splitter': 'best'}

#最优的预测结果

GS.best_score_
 Out[64]: 0.8183279742765274

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：android 修改rc文件 android修改系统文件

下一篇：win7如何导出mysql数据库文件怎么导出mysql数据库

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

剪切 文件 Python python剪枝

剪切 文件 Python python剪枝

51CTO博客

剪切文件 Python python剪枝

剪切文件 Python python剪枝