本文主要完成如下内容
- 简单介绍GBDT;
- 介绍sklearn中GBDT算法(GradientBoostingClassifier)的参数;
- 介绍使用pandas模块分析训练数据的方法;
- 介绍使用网格搜索对GBDT调参的方法技巧;
GBDT介绍
GBDT全称梯度下降树,可以用于分类(做二分类效果还可以,做多分类效果不好)、回归(适合做回归)问题,也可以筛选特征。本次使用GBDT解决分类、特征重要性排序问题。
GBDT = Gradient Boosting + Decision Tree
Gradient Boosting = Gradient Descent + Boosting
Boosting是一种按照加法模型叠加simple model的方法,以及不断减少训练过程产生的残差来达到将数据分类或者回归的算法。Decision Tree是决策树,Gradient Descent是常用的梯度下降算法。
GBDT弱分类器默认选择的是CART TREE,当然也可以选择其他弱分类器,选择的前提是低方差、高偏差,框架服从Boosting框架即可。
sklearn模块中GBDT算法参数介绍
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, min_impurity_decrease=0.,
min_impurity_split=None, init=None,
random_state=None, max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False,
presort='auto', validation_fraction=0.1,
n_iter_no_change=None, tol=1e-4):
一、与树结构相关的参数
- min_samples_split:最小可分样本数,即到达某节点时,确定是否需要再分下去,如果这个节点的样本数小于阈值,则停止分裂。这个参数可以用来降低过拟合,设置较大值时,可以有效减少因某些非重要特征作为分裂点;当这个参数设置过大时,可能会造成模型欠拟合,具体设置需要考虑样本数、样本均衡性和CV确定;
- min_samples_leaf:最小叶节点样本数,即某叶节点的样本数过小时,应该回退到上一节点,相当于剪枝;同样可以用来降低过拟合风险;当样本不均衡时,尤其需要注意这个参数的设置,因为这意味着比例过小的类能否有效区分出来;
- min_weight_fraction_leaf:与min_samples_leaf类似,但这里设置的不是样本数,而是整体样本的比例;
- max_depth:树的深度;用来防止过拟合,单颗树过深可能会学到无关特征;
- max_leaf_nodes:树的叶节点数,用来防止过拟合,如果设置该参数,则max_depth会被忽略;
- max_features:待分裂的特征数,GBM参考了随机森林的做法,分裂时只选用了一部分特征来降低树之间的相关性,用来降低过拟合,一般用log,平方根的特征数目作为参数候选值;
二、booting相关的参数
- learning_rate:学习率,即控制基模型带来拟合效果的权重,较低的学习率通常有较好的拟合效果,毕竟步子迈精细一点,在较高的误差外发生震荡的可能性变小了,较低的学习率通常需要辅以较多的基学习器,这也意味着学习效率会降低;
- n_estimators:基学习器的个数,这里是树的颗数,当学习率不变时,较多的基学习器会带来过拟合的风险,一般该参数需要与learning_rate结合调整;
- subsample:子采样数,每棵树的构建并不会取全部样本,而是随机抽取一部分样本,参考的也是随机森林的思想,但这里的抽样是不放回,这种方法也是用来降低过拟合的风险,通常取值0.8左右;
三、其它参数
- loss:损失函数,依分类问题和回归问题,损失函数选取不一样,通常选默认的损失函数即可;
- init:这个参数的输入是模型变量,即GBM的启动模型;
- random_state:随机状态参数,即随机种子,当调参时,该参数需固定,否则根据CV的调参会产生影响;
- verbose:决定日志(训练过程)是否需要打印,默认不打印;
- warm_start:热启动,当你训练GBM到一定程度停止时,如果你想在这个基础上接着训练,就需要用到该参数(true)较少重复训练。
整体实现思路
本实验所用的训练数据和《【sklearn】SVM(支持向量机) - 预测在网具有单卡转合约倾向的客户》相同,在此就不对训练数据做过多介绍了。
本实验实现的大体思路如下:
- 分析训练数据的分布情况;
- 对数据进行特征编码、归一化等预处理操作;
- 利用交叉验证训练简单模型,并对模型进行评估,对特征重要程度进行排序;
- 设置较高的learning_rate,调试迭代次数:n_estimators;
- 固定learning_rate,选取最优的n_estimators,调试max_depth和min_samples_split;
- 固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split,调试min_samples_leaf;
- 固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split、min_samples_leaf,调试max_features;
- 固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split、min_samples_leaf、max_features,调试subsample,并选取最优的subsample;
- 调试learning_rate和n_estimators:减少learning_rate,成比例的增加n_estimators;
- 训练使用最优超参的模型,并对模型进行评估。
实现代码
导入相应模块:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号‘-’显示为方块的问题
从本地读取训练数据:
# 从本地读取训练数据
data = pd.read_csv(r'./data_carrier_svm.csv', encoding='utf-8')
data.head()
不同用户的主叫时长分布情况对比:
# 数据探索
# 不同用户的主叫时长分布情况对比
cond = data['是否潜在合约用户'] == 1
data[cond]['主叫时长(分)'].hist(alpha=0.5, label='潜在合约用户')
data[~cond]['主叫时长(分)'].hist(color='r', alpha=0.5, label='非潜在合约用户')
plt.legend()
不同用户的被叫时长分布情况对比:
# 不同用户的被叫时长分布情况对比
cond = data['是否潜在合约用户'] == 1
data[cond]['被叫时长(分)'].hist(alpha=0.5, label='潜在合约用户')
data[~cond]['被叫时长(分)'].hist(color='r', alpha=0.5, label='非潜在合约用户')
plt.legend()
不同用户的业务类型情况对比:
# 不同用户的业务类型情况对比
grouped = data.groupby(['是否潜在合约用户', '业务类型'])['用户标识'].count().unstack()
grouped.plot(kind='bar', alpha=1.0, rot=0)
统计各类数据的数量:
# 统计各类数据的数量
data['是否潜在合约用户'].value_counts()
不同类型的可视化:
# 不同类型的可视化
# 生成数据可视化
y = data.loc[:, '是否潜在合约用户']
plt.scatter(data.loc[:, '主叫时长(分)'], data.loc[:, '免费流量'], c=y, alpha=0.5)
分割特征数据集和标签数据集:
# 数据预处理
# 分割特征数据集和便签数据集
X = data.loc[:, '业务类型': '余额']
y = data.loc[:, '是否潜在合约用户']
print('The shape of X is {0}'.format(X.shape))
print('The shape of y is {0}'.format(y.shape))
类别特征编码:
# 类别特征编码
# 自定义转换函数
def service_mapping(cell):
if cell == '2G':
return 2
elif cell == '3G':
return 3
elif cell == '4G':
return 4
# 将业务类型的string型值映射为整数型
service_map = X['业务类型'].map(service_mapping)
service = pd.DataFrame(service_map)
# 使用OneHotEncoder转化类型特征为0/1编码的多维特征
enc = OneHotEncoder()
service_enc = enc.fit_transform(service).toarray()
service_enc
# 0/1编码的多维特征的名称
service_names = enc.active_features_.tolist()
service_newname = [str(x) + 'G' for x in service_names]
service_df = pd.DataFrame(service_enc, columns=service_newname)
service_df.head()
X_enc = pd.concat([X, service_df], axis=1).drop('业务类型', axis=1)
X_enc.head()
数据归一化:
# 数据归一化/正则化
from sklearn.preprocessing import normalize
X_normalized = normalize(X_enc)
X_normalized
将数据集分为训练集和测试集:
# 分割数据集
# 将数据集分为训练集和测试集
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=112)
print('The shape of X_train is {0}'.format(X_train.shape))
print('The shape of X_test is {0}'.format(X_test.shape))
X_train
生成数据可视化:
# 生成数据可视化
# plt.scatter(X_train.iloc[:, 0], X_train.iloc[:, 1], c=y_train, alpha=0.5)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, alpha=0.5)
生成模型训练评估函数:
# 生成模型训练评估函数
def modelfit(alg, X_train, y_train, performCV=True, printFeatureImportance=True, cv_folds=5):
alg.fit(X_train, y_train)
# predict training set:
train_predictions = alg.predict(X_train) # 返回预测标签
train_predprob = alg.predict_proba(X_train)[:, 1] # 返回预测属于某标签的概率
# preform cross-validation: here the author calculate cross-validated AUC
if performCV:
cv_score = cross_val_score(alg, X_train, y_train, cv=cv_folds, scoring='roc_auc') # 交叉验证
# print model report
print('\nModel Report')
print('Accuracy (Train): %3.4f' % metrics.accuracy_score(y_train.values, train_predictions))
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print('AUC Score (Train): %f' % metrics.roc_auc_score(y_train, train_predprob))
if performCV:
print('CV Score: Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g' % (
np.mean(cv_score), np.std(cv_score), np.min(cv_score), np.max(cv_score)))
# print Feature Importance:
if printFeatureImportance:
# feat_imp = pd.Series(alg.feature_importances_, X_train.columns.tolist()).sort_values(ascending=True)
feat_imp = pd.Series(alg.feature_importances_, X_enc.columns.tolist()).sort_values(ascending=True)
feat_imp.plot(kind='barh', title='Feature Importances')
plt.ylabel('Feature Importance Score')
_ = plt.xlabel('Relative importance')
训练简单模型作为baseline model:
# 训练简单模型作为baseline model
# 模型实例化
clf0 = GradientBoostingClassifier(random_state=110)
# 在训练集上训练模型
clf0.fit(X_train, y_train)
# 在测试集上预测
y_pred = clf0.predict(X_test)
# 计算准确率
score = metrics.accuracy_score(y_test, y_pred)
print('The accuracy score of the model is: {0}'.format(score))
# 查看混淆矩阵
metrics.confusion_matrix(y_test, y_pred)
简单模型评估、特征重要程度排序:
# 简单模型评估
# 模型实例化
clf0 = GradientBoostingClassifier(random_state=110)
modelfit(clf0, X_train, y_train)
设置较高的learning_rate,调试迭代次数:n_estimators;
# GBDT调参
# 设置迭代次数的范围
param_test1 = {'n_estimators': range(20, 251, 10)}
estimator = GradientBoostingClassifier(learning_rate=0.2, min_samples_split=50, min_samples_leaf=5, max_depth=8,
max_features='sqrt', subsample=0.8, random_state=10)
gsearch1 = GridSearchCV(estimator, param_grid=param_test1, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch1.fit(X_train, y_train)
# examine the best model
# print(gsearch1.grid_scores_)
print(gsearch1.best_score_)
print(gsearch1.best_params_)
print(gsearch1.best_estimator_)
固定learning_rate,选取最优的n_estimators,调试max_depth和min_samples_split;
# 调试决策树的相关参数:max_features和min_samples_split
# Grid search on subsample and max_features
param_test2 = {'max_depth': range(1, 9, 1), 'min_samples_split': range(10, 101, 10)}
estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=240, max_features='sqrt', subsample=0.8,
random_state=10)
gsearch2 = GridSearchCV(estimator, param_grid=param_test2, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch2.fit(X_train, y_train)
print(gsearch2.best_score_)
print(gsearch2.best_params_)
print(gsearch2.best_estimator_)
固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split,调试min_samples_leaf;
# 调试决策树的相关参数:min_samples_split和min_samples_leaf
# Grid search on min_samples_split and min_samples_leaf
# param_test3 = {'min_samples_split': range(90, 201, 20), 'min_samples_leaf': range(5, 51, 5)}
param_test3 = {'min_samples_leaf': range(5, 51, 5)}
estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=240, max_depth=5, min_samples_split=80, max_features='sqrt',
subsample=0.8, random_state=10)
gsearch3 = GridSearchCV(estimator, param_grid=param_test3, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch3.fit(X_train, y_train)
print(gsearch3.best_score_)
print(gsearch3.best_params_)
print(gsearch3.best_estimator_)
固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split、min_samples_leaf,调试max_features;
# 调试决策树的相关参数:max_features
# Grid search on max_features
param_test4 = {'max_features': range(1, 8, 1)}
estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=240, max_depth=5, min_samples_split=80,
min_samples_leaf=10, subsample=0.8, random_state=10)
gsearch4 = GridSearchCV(estimator, param_grid=param_test4, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch4.fit(X_train, y_train)
print(gsearch4.best_score_)
print(gsearch4.best_params_)
print(gsearch4.best_estimator_)
固定learning_rate,选取最优的n_estimators、max_depth、min_samples_split、min_samples_leaf、max_features,调试subsample,并选取最优的subsample;
# 调试决策树相关参数:Sunsample
# Grid search on subsample and max_features
param_test5 = {'subsample': [0.6, 0.7, 0.75, 0.8, 0.85, 0.9]}
estimator = GradientBoostingClassifier(learning_rate=0.2, n_estimators=240, max_depth=5, min_samples_split=80,
min_samples_leaf=10, max_features=3, random_state=10)
gsearch5 = GridSearchCV(estimator, param_grid=param_test5, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
gsearch5.fit(X_train, y_train)
print(gsearch5.best_score_)
print(gsearch5.best_params_)
print(gsearch5.best_estimator_)
调试learning_rate和n_estimators:减少learning_rate,成比例的增加n_estimators;
当learning_rate=0.1, n_estimators=480:
# 调试learning rate和n_estimator
# learning_rate=0.1, n_estimators=480
gbm_tuned_1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=480, max_depth=5, min_samples_split=80,
min_samples_leaf=10, max_features=3, subsample=0.8, random_state=10)
modelfit(gbm_tuned_1, X_train, y_train)
当learning_rate=0.05, n_estimators=960:
# learning_rate=0.05, n_estimators=960
gbm_tuned_2 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=960, max_depth=5, min_samples_split=80,
min_samples_leaf=10, max_features=3, subsample=0.8, random_state=10)
modelfit(gbm_tuned_2, X_train, y_train)
当learning_rate=0.01, n_estimators=4800
# learning_rate=0.01, n_estimators=4800
gbm_tuned_3 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=4800, max_depth=5, min_samples_split=80,
min_samples_leaf=10, max_features=3, subsample=0.8, random_state=10)
modelfit(gbm_tuned_3, X_train, y_train)
训练使用最优超参的模型:
# 训练使用最优超参的模型
# 模型实例化
# learning_rate=0.01, n_estimators=4800
gbm_tuned_3 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=4800, max_depth=5, min_samples_split=80,
min_samples_leaf=10, max_features=3, subsample=0.8, random_state=10)
# 在训练集上训练模型
gbm_tuned_3.fit(X_train, y_train)
# 在测试集上预测
y_pred = gbm_tuned_3.predict(X_test)
y_pred_proba = gbm_tuned_3.predict_proba(X_test)[:, 1]
对模型进行评估:
# 模型评估
# 计算准确率
score = metrics.accuracy_score(y_test, y_pred)
print('The accuracy score of the model for test data is: {0}'.format(score))
auc_score = metrics.roc_auc_score(y_test, y_pred_proba)
print('The accuracy score of the model for test data is: {0}'.format(auc_score))
# 查看混淆矩阵
metrics.confusion_matrix(y_test, y_pred)
# IMPORTANT: first argument is true values, second argument is predicted probabilities
# fpr: false positive rate (=1 - specifity), tpr = true positive rate
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
小结
使用网格搜索调参时,为了加快调参速率,可以采取如下方法:
- 刚开始调参时,适当选取较大的learning_rate,以加快程序执行速度;
- 把超参分开进行调试;
在调参时,还应该注意如下事项:
- 若训练数据分布不均衡,可采用子采样的方法;
- 若某一个参数的最优值为其可取值范围的最大值或最小值,则,应调整该参数的取值范围,然后重新对其调参。例:参数n_estimators的取值范围为:param_test1 = {'n_estimators': range(20, 81, 10)},若执行print(gsearch1.best_params_)后,n_estimators的取值为{'n_estimators': 80},则应该渐进增大n_estimators的取值范围再次进行调参,可以把其改成{'n_estimators': range(70, 151, 10)}