ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(四)

 

 

 

目录

Step 3: Tune gamma步骤3:伽马微调

Step 4: Tune subsample and colsample_bytree第4步:调整subsample和colsample_bytree

Step 5: Tuning Regularization Parameters步骤5:调整正则化参数

Step 6: Reducing Learning Rate第6步:降低学习率

​​​​​​​尾注/End Notes


​​​​​​​

 

 

 

 

 

 

原文题目:《Complete Guide to Parameter Tuning in XGBoost with codes in Python》
原文地址https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
所有权为原文所有,本文只负责翻译。

相关文章
ML之XGBoost:XGBoost算法模型(相关配图)的简介(XGBoost并行处理)、关键思路、代码实现(目标函数/评价函数)、安装、使用方法、案例应用之详细攻略
ML之XGBoost:Kaggle神器XGBoost算法模型的简介(资源)、安装、使用方法、案例应用之详细攻略
ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(一)
ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(二)
ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(三)
ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(四)

 

Step 3: Tune gamma
步骤3:伽马微调

Now lets tune gamma value using the parameters already tuned above. Gamma can take various values but I’ll check for 5 values here. You can go into more precise values as.
现在让我们使用上面已经调整过的参数来调整gamma值。gamma可以取不同的值,但我在这里检查5个值。您可以使用更精确的值。

param_test3 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

This shows that our original value of gamma, i.e. 0 is the optimum one. Before proceeding, a good idea would be to re-calibrate the number of boosting rounds for the updated parameters.
这表明我们的伽玛原值,即0是最佳值。在继续之前,一个好主意是为更新的参数重新校准助boosting的数量。

xgb2 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb2, train, predictors)

Here, we can see the improvement in score. So the final parameters are:
在这里,我们可以看到分数的提高。所以最终参数是

  • max_depth: 4
  • min_child_weight: 6
  • gamma: 0

 

Step 4: Tune subsample and colsample_bytree
第4步:调整subsample和colsample_bytree

The next step would be try different subsample and colsample_bytree values. Lets do this in 2 stages as well and take values 0.6,0.7,0.8,0.9 for both to start with.
下一步将尝试不同的子样本和列样本树值。让我们分两个阶段来完成这项工作,从0.6、0.7、0.8、0.9开始。

param_test4 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

Here, we found 0.8 as the optimum value for both subsample and colsample_bytree. Now we should try values in 0.05 interval around these.
在这里,我们发现0.8是子样本和colsample_bytree的最佳值。现在我们应该在0.05间隔内尝试这些值。

param_test5 = {
 'subsample':[i/100.0 for i in range(75,90,5)],
 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])

Again we got the same values as before. Thus the optimum values are:
我们又得到了和以前一样的值。因此,最佳值为:

  • subsample: 0.8
  • colsample_bytree: 0.8

 

Step 5: Tuning Regularization Parameters
​​​​​​​步骤5:调整正则化参数

Next step is to apply regularization to reduce overfitting. Though many people don’t use this parameters much as gamma provides a substantial way of controlling complexity. But we should always try it. I’ll tune ‘reg_alpha’ value here and leave it upto you to try different values of ‘reg_lambda’.
下一步是应用正则化来减少过拟合。虽然许多人不使用这个参数,因为gamma提供了一种控制复杂性的实质性方法。但我们应该经常尝试。我将在这里调整“reg_alpha”值,并让您尝试不同的“reg_lambda”值。

param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_

We can see that the CV score is less than the previous case. But the values tried are very widespread, we should try values closer to the optimum here (0.01) to see if we get something better.
我们可以看到CV的分数低于前一个案例。但是尝试的值非常广泛,我们应该尝试接近最佳值的值(0.01),看看我们是否能得到更好的结果。

param_test7 = {
 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_

You can see that we got a better CV. Now we can apply this regularization in the model and look at the impact:
你可以看到我们有更好的CV。现在,我们可以在模型中应用此正则化,并查看影响:

xgb3 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb3, train, predictors)

Again we can see slight improvement in the score.
我们可以再次看到分数略有提高。
 

Step 6: Reducing Learning Rate
第6步:降低学习率

Lastly, we should lower the learning rate and add more trees. Lets use the cv function of XGBoost to do the job again.
最后,我们应该降低学习率,增加更多的树。让我们再次使用xgboost的cv功能来完成这项工作。

xgb4 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb4, train, predictors)

Now we can see a significant boost in performance and the effect of parameter tuning is clearer.
​​​​​​​现在我们可以看到性能的显著提高,参数调整的效果也更加明显。

As we come to the end, I would like to share 2 key thoughts:
最后,我想分享2个关键思想:

  1. It is difficult to get a very big leap in performance by just using parameter tuning or slightly better models. The max score for GBM was 0.8487 while XGBoost gave 0.8494. This is a decent improvement but not something very substantial.
    仅仅使用参数调整或稍好的型号,很难在性能上获得很大的飞跃。GBM最高得分为0.8487,XGBoost最高得分为0.8494。这是一个不错的改进,但不是很实质的改进。
  2. A significant jump can be obtained by other methods like feature engineering, creating ensemble of models, stacking, etc
    通过其他方法,如特征工程、创建模型集成、叠加等,可以获得显著的提升。

You can also download the iPython notebook with all these model codes from my GitHub account. For codes in R, you can refer to this article.
您也可以从我的Github帐户下载包含所有这些型号代码的ipython笔记本。有关R中的代码,请参阅本文。

 

​​​​​​​尾注/End Notes

This article was based on developing a XGBoost model end-to-end. We started with discussing why XGBoost has superior performance over GBM which was followed by detailed discussion on the various parameters involved. We also defined a generic function which you can re-use for making models.
本文基于开发一个xgboost模型端到端。我们首先讨论了xgboost为什么比gbm有更好的性能,然后详细讨论了所涉及的各种参数。我们还定义了一个通用函数,您可以使用它来创建模型。

Finally, we discussed the general approach towards tackling a problem with XGBoost and also worked out the AV Data Hackathon 3.x problem through that approach.
最后,我们讨论了解决xgboost问题的一般方法,并通过该方法解决了av data hackathon 3.x问题。

I hope you found this useful and now you feel more confident to apply XGBoost in solving a data science problem. You can try this out in out upcoming hackathons.
我希望您发现这一点很有用,现在您对应用XGBoost解决数据科学问题更有信心。你可以在即将到来的黑客攻击中尝试一下。

Did you like this article? Would you like to share some other hacks which you implement while making XGBoost models? Please feel free to drop a note in the comments below and I’ll be glad to discuss.
​​​​​​​你喜欢这篇文章吗?您是否愿意分享一些其他的黑客,在制作XGBoost模型时您实现这些黑客?请在下面的评论中留言,我很乐意与您讨论。

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.
你想运用你的分析能力来测试你的潜力吗?然后参与我们的黑客活动并与来自世界各地的顶尖数据科学家竞争。