python简介
有效地在Python中堆叠模型 (Stacking models in Python efficiently)
Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many data science pipelines have ensembles in them.
集成已Swift成为应用机器学习中最热门和最受欢迎的方法之一。 实际上, 每一个获奖的Kaggle解决方案都以它们为特色,并且许多数据科学流水线都包含其中。
Put simply, ensembles combine predictions from different models to generate a final prediction, and the more models we include the better it performs. Better still, because ensembles combine baseline predictions, they perform at least as well as the best baseline model. Ensembles give us a performance boost almost for free!
简而言之,合奏组合了来自不同模型的预测以生成最终预测,而我们包含的模型越多,其执行效果就越好。 更好的是,由于集合结合了基线预测,因此它们的性能至少与最佳基线模型一样好。 乐团几乎免费为您提供性能提升!
Example schematics of an ensemble. An input array $X$ is fed through two preprocessing pipelines and then to a set of base learners $f^{(i)}$. The ensemble combines all base learner predictions into a final prediction array $P$.
Source
合奏的示例示意图。 输入数组$ X $通过两个预处理管道馈送,然后馈送到一组基础学习者$ f ^ {(i)} $。 集合将所有基础学习者的预测合并为最终预测数组$ P $。 资源
In this post, we’ll take you through the basics of ensembles — what they are and why they work so well — and provide a hands-on tutorial for building basic ensembles. By the end of this post, you will:
在这篇文章中,我们将带您了解乐团的基础知识—它们是什么以及它们为何如此出色—并提供了动手教程来构建基本乐团。 在这篇文章的结尾,您将:
- understand the fundamentals of ensembles
- know how to code them
- understand the main pitfalls and drawbacks of ensembles
- 了解合奏的基本原理
- 知道如何编码
- 了解乐团的主要陷阱和弊端
预测共和党和民主党的捐款 (Predicting Republican and Democratic donations)
To illustrate how ensembles work, we’ll use a data set on U.S. political contributions. The original data set was prepared by Ben Wieder at FiveThirtyEight, who dug around the U.S. government’s political contribution registry and found that when scientists donate to politician, it’s usually to Democrats.
为了说明合奏的工作方式,我们将使用有关美国政治捐款的数据集。 原始数据集是由FiveThirtyEight的Ben Wieder编写的,他在美国政府的政治捐款登记处进行了挖掘,发现当科学家向政客捐款时,通常是民主党人 。
This claim is based on the observation on the share of donations being made to Republicans and Democrats. However, there’s plenty more that can be said: for instance, which scientific discipline is most likely to make a Republican donation, and which state is most likely to make Democratic donations? We will go one step further and predict whether a donation is most likely to be a to a Republican or Democrat.
这项索赔是基于对共和党和民主党人的捐款份额的观察。 但是,还有很多话可以说:例如,哪个科学学科最有可能做出共和党捐款,哪个州最有可能做出民主党捐款? 我们将更进一步,预测捐款最有可能是向共和党或民主党捐款。
The data we use here is slightly adapted. We remove any donations to party affiliations other than Democrat or Republican to make our exposition a little clearer and drop some duplicate and less interesting features. The data script can be found here. Here’s the data:
我们在这里使用的数据略有调整。 我们会删除除民主党或共和党以外的对党派的任何捐款,以使我们的展览更加清晰,并删除一些重复且不太有趣的功能。 数据脚本可以在这里找到。 数据如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
### Import data
# Always good to set a seed for reproducibility
SEED = 222
np.random.seed(SEED)
df = pd.read_csv('input.csv')
### Training and test set
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
def get_train_test(test_size=0.95):
"""Split Data into train and test sets."""
y = 1 * (df.cand_pty_affiliation == "REP")
X = df.drop(["cand_pty_affiliation"], axis=1)
X = pd.get_dummies(X, sparse=True)
X.drop(X.columns[X.std() == 0], axis=1, inplace=True)
return train_test_split(X, y, test_size=test_size, random_state=SEED)
xtrain, xtest, ytrain, ytest = get_train_test()
# A look at the data
print("nExample data:")
df.head()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
### Import data
# Always good to set a seed for reproducibility
SEED = 222
np.random.seed(SEED)
df = pd.read_csv('input.csv')
### Training and test set
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
def get_train_test(test_size=0.95):
"""Split Data into train and test sets."""
y = 1 * (df.cand_pty_affiliation == "REP")
X = df.drop(["cand_pty_affiliation"], axis=1)
X = pd.get_dummies(X, sparse=True)
X.drop(X.columns[X.std() == 0], axis=1, inplace=True)
return train_test_split(X, y, test_size=test_size, random_state=SEED)
xtrain, xtest, ytrain, ytest = get_train_test()
# A look at the data
print("nExample data:")
df.head()
cand_pty_affiliation | cand_pty_affiliation | cand_office_st | cand_office_st | cand_office | cand_office | cand_status | cand_status | rpt_tp | rpt_tp | transaction_tp | transaction_tp | entity_tp | 实体_tp | state | 州 | classification | 分类 | cycle | 周期 | transaction_amt | transaction_amt | ||
0 | 0 | REP | REP | US | 我们 | P | P | C | C | Q3 | Q3 | 15 | 15 | IND | IND | NY | 纽约州 | Engineer | 工程师 | 2016.0 | 2016.0 | 500.0 | 500.0 |
1 | 1个 | DEM | DEM | US | 我们 | P | P | C | C | M5 | M5 | 15E | 15E | IND | IND | OR | 要么 | Math-Stat | 数学统计 | 2016.0 | 2016.0 | 50.0 | 50.0 |
2 | 2 | DEM | DEM | US | 我们 | P | P | C | C | M3 | M3 | 15 | 15 | IND | IND | TX | 德克萨斯州 | Scientist | 科学家 | 2008.0 | 2008.0 | 250.0 | 250.0 |
3 | 3 | DEM | DEM | US | 我们 | P | P | C | C | Q2 | Q2 | 15E | 15E | IND | IND | IN | 在 | Math-Stat | 数学统计 | 2016.0 | 2016.0 | 250.0 | 250.0 |
4 | 4 | REP | REP | US | 我们 | P | P | C | C | 12G | 12G | 15 | 15 | IND | IND | MA | 嘛 | Engineer | 工程师 | 2016.0 | 2016.0 | 184.0 | 184.0 |
The figure above is the data underlying Ben’s claim. Indeed, between Democrats and Republicans, about 75% of all contributions are made to democrats. Let’s go through the features at our disposal. We have data about the donor, the transaction, and the recipient:
上图是Ben的索赔所依据的数据。 实际上,在民主党人和共和党人之间,所有捐款中约有75%是向民主党人提供的。 让我们浏览一下我们可以使用的功能。 我们有有关捐赠者,交易和接收者的数据:
To measure how well our models perform, we use the ROC-AUC score, which trades off having high precision and high recall (if these concepts are new to you, see the Wikipedia entry on precision and recall for a quick introduction). If you haven’t used this metric before, a random guess has a score of 0.5 and perfect recall and precision yields 1.0.
为了衡量模型的性能,我们使用ROC-AUC得分,该得分具有较高的精度和较高的查全率(如果您不熟悉这些概念,请参见Wikipedia的“ 精度和查全率”条目以进行快速介绍)。 如果您以前从未使用过该指标,则随机猜测的得分为0.5,完美的回忆和精确度为1.0。
什么是合奏? (What is an ensemble?)
Imagine that you are playing trivial pursuit. When you play alone, there might be some topics you are good at, and some that you know next to nothing about. If we want to maximize our trivial pursuit score, we need build a team to cover all topics. This is the basic idea of an ensemble: combining predictions from several models averages out idiosyncratic errors and yield better overall predictions.
想象一下,您在玩琐碎的追求。 当您一个人玩耍时,可能会有一些您擅长的主题,而有些却几乎一无所知。 如果我们想最大程度地提高琐碎的追求分数,就需要建立一个涵盖所有主题的团队。 这是合奏的基本思想:将来自多个模型的预测组合起来可以平均出特质误差,并产生更好的整体预测。
An important question is how to combine predictions. In our trivial pursuit example, it is easy to imagine that team members might make their case and majority voting decides which to pick. Machine learning is remarkably similar in classification problems: taking the most common class label prediction is equivalent to a majority voting rule. But there are many other ways to combine predictions, and more generally we can use a model to learn how to best combine predictions.
一个重要的问题是如何结合预测。 在我们琐碎的追求示例中,很容易想象团队成员可能会做出自己的决定,而多数投票会决定选择哪一个。 机器学习在分类问题上非常相似:采用最常见的类别标签预测等同于多数表决规则。 但是还有许多其他组合预测的方法,更广泛地说,我们可以使用模型来学习如何最好地组合预测。
Basic ensemble structure. Data is fed to a set of models, and a meta learner combine model predictions. Source
基本的整体结构。 数据被馈送到一组模型,而元学习者将模型预测结合起来。
通过组合决策树来了解合奏 (Understanding ensembles by combining decision trees)
To illustrate the machinery of ensembles, we’ll start off with a simple interpretable model: a decision tree, which is a tree of if-then
rules. If you’re unfamiliar with decision trees or would like to dive deeper, check out the decision trees course on Dataquest. The deeper the tree, the more complex the patterns it can capture, but the more prone to overfitting it will be. Because of this, we will need an alternative way of building complex models of decision trees, and an ensemble of different decision trees is one such way.
为了说明合奏的机制,我们将从一个简单的可解释模型开始:决策树,它是if-then规则树。 如果您不熟悉决策树或想深入了解,请查看Dataquest上的决策树课程 。 树越深,可以捕获的图案越复杂,但是越容易适应。 因此,我们将需要一种替代方法来构建决策树的复杂模型,而不同决策树的集合就是这样一种方法。
We’ll use the below helper function to visualize our decision rules:
我们将使用以下帮助器函数来可视化我们的决策规则:
import pydotplus # you can install pydotplus with: pip install pydotplus
from IPython.display import Image
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
def print_graph(clf, feature_names):
"""Print decision tree."""
graph = export_graphviz(
clf,
label="root",
proportion=True,
impurity=False,
out_file=None,
feature_names=feature_names,
class_names={0: "D", 1: "R"},
filled=True,
rounded=True
)
graph = pydotplus.graph_from_dot_data(graph)
return Image(graph.create_png())
import pydotplus # you can install pydotplus with: pip install pydotplus
from IPython.display import Image
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
def print_graph(clf, feature_names):
"""Print decision tree."""
graph = export_graphviz(
clf,
label="root",
proportion=True,
impurity=False,
out_file=None,
feature_names=feature_names,
class_names={0: "D", 1: "R"},
filled=True,
rounded=True
)
graph = pydotplus.graph_from_dot_data(graph)
return Image(graph.create_png())
Let’s fit a decision tree with a single node (decision rule) on our training data and see how it perform on the test set:
让我们在训练数据上拟合一个带有单个节点(决策规则)的决策树,并查看其如何在测试集中执行:
Decision tree ROC-AUC score: 0.672
决策树ROC-AUC得分:0.672
Each of the two leaves register their share of training samples, the class distribution within their share, and the class label prediction. Our decision tree bases its prediction on whether the the size of the contribution is above 101.5: but it makes the same prediction regardless! This is not too surprising given that 75% of all donations are to Democrats. But it’s not making use of the data we have. Let’s use three levels of decision rules and see what we can get:
两片叶子中的每一个都记录其训练样本的份额,其份额内的类别分布以及类别标签预测。 我们的决策树基于贡献大小的预测是否大于101.5进行预测:但是无论如何,它都会做出相同的预测! 鉴于所有捐款中有75%是捐给民主党人,这并不奇怪。 但这并没有利用我们拥有的数据。 让我们使用三个层次的决策规则,看看我们能得到什么:
t2 = DecisionTreeClassifier(max_depth=3, random_state=SEED)
t2.fit(xtrain, ytrain)
p = t2.predict_proba(xtest)[:, 1]
print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t2, xtrain.columns)
t2 = DecisionTreeClassifier(max_depth=3, random_state=SEED)
t2.fit(xtrain, ytrain)
p = t2.predict_proba(xtest)[:, 1]
print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t2, xtrain.columns)
决策树ROC-AUC得分:0.751
This model is not much better than the simple decision tree: a measly 5% of all donations are predicted to go to Republicans–far short of the 25% we would expect. A closer look tells us that the decision tree uses some dubious splitting rules. A whopping 47.3% of all observations end up in the left-most leaf, while another 35.9% end up in the leaf second to the right. The vast majority of leaves are therefore irrelevant. Making the model deeper just causes it to overfit.
这个模型并不比简单的决策树好得多:预计只有5%的捐款将捐给共和党人-远远低于我们预期的25%。 仔细观察可以发现,决策树使用了一些可疑的拆分规则。 在所有观察结果中,高达47.3%的结果最终出现在最左边的叶子中,而另外35.9%的结果最终出现在最右边的叶子中。 因此,绝大多数叶子都是无关紧要的。 使模型更深只会导致过拟合。
Fixing depth, a decision tree can be made more complex by increasing “width”, that is, creating several decision trees and combining them. In other words, an ensemble of decision trees. To see why such a model would help, consider how we may force a decision tree to investigate other patterns than those in the above tree. The simplest solution is to remove features that appear early in the tree. Suppose for instance that we remove the transaction amount feature (transaction_amt
), the root of the tree. Our new decision tree would look like this:
确定深度后,可以通过增加“宽度”来使决策树变得更复杂,也就是说,创建多个决策树并将其组合。 换句话说,就是决策树的集合。 要了解为什么这样的模型会有所帮助,请考虑我们如何迫使决策树调查除上述树以外的其他模式。 最简单的解决方案是删除树中较早出现的功能。 例如,假设我们删除了交易额功能( transaction_amt
),即树的根。 我们的新决策树如下所示:
Decision tree ROC-AUC score: 0.740
决策树ROC-AUC得分:0.740
The ROC-AUC score is similar, but the share of Republican donation increased to 7.3%. Still too low, but higher than before. Importantly, in contrast to the first tree, where most of the rules related to the transaction itself, this tree is more focused on the residency of the candidate. We now have two models that by themselves have similar predictive power, but operate on different rules. Because of this, they are likely to make different prediction errors, which we can average out with an ensemble.
ROC-AUC得分相似,但共和党捐赠的份额增加到7.3%。 仍然太低,但是比以前更高。 重要的是,与第一棵树相反,在第一棵树中,大多数规则与交易本身有关,该树更侧重于候选人的居住地。 现在,我们有两个模型,它们本身具有相似的预测能力,但是在不同的规则上运行。 因此,他们可能会做出不同的预测误差,我们可以将其平均化。
插曲:为何平均预测有效 (Interlude: why averaging predictions work)
Why would we expect averaging predictions to work? Consider a toy example with two observations that we want to generate predictions for. The true label for the first observation is Republican, and the true label for the second observation is Democrat. In this toy example, suppose model 1 is prone to predicting Democrat while model 2 is prone to predicting Republican, as in the below table:
为什么我们期望平均预测有效? 考虑一个带有两个观察值的玩具示例,我们要为其生成预测。 第一次观察的真实标签是共和党,而第二次观察的真实标签是民主党。 在此玩具示例中,假设模型1容易预测民主党,而模型2容易预测共和党,如下表所示:
Model | 模型 | Observation 1 | 观察1 | Observation 2 | 观察2 |
True label | 真实标签 | R | [R | D | d |
Model prediction: $ P(R)$ | 模型预测:$ P(R)$ | ||||
Model 1 | 模型1 | 0.4 | 0.4 | 0.2 | 0.2 |
Model 2 | 模型2 | 0.8 | 0.8 | 0.6 | 0.6 |
If we use the standard 50% cutoff rule for making a class prediction, each decision tree gets one observation right and one wrong. We create an ensemble by averaging the model’s class probabilities, which is a majority vote weighted by the strength (probability) of model’s prediction. In our toy example, model 2 is certain of its prediction for observation 1, while model 1 is relatively uncertain. Weighting their predictions, the ensemble favors model 2 and correctly predicts Republican. For the second observation, tables are turned and the ensemble correctly predicts Democrat:
如果我们使用标准的50%截止规则进行类预测,则每个决策树都会得到一个正确的观察和错误的观察。 我们通过对模型的类别概率求平均值来创建集合,该概率是由模型预测的强度(概率)加权的多数表决。 在我们的玩具示例中,模型2可以肯定其对观察1的预测,而模型1则相对不确定。 加权他们的预测,合奏偏爱模型2并正确预测共和党。 对于第二个观察,转过桌子,并且合奏正确地预测了民主党:
Model | 模型 | Observation 1 | 观察1 | Observation 2 | 观察2 |
True label | 真实标签 | R | [R | D | d |
Ensemble | 合奏 | 0.6 | 0.6 | 0.4 | 0.4 |
With more than two decision trees, the ensemble predicts in accordance with the majority. For that reason, an ensemble that averages classifier predictions is known as a majority voting classifier. When an ensembles averages based on probabilities (as above), we refer to it as soft voting, averaging final class label predictions is known as hard voting.
拥有两个以上的决策树,该集合按照大多数进行预测。 因此,将分类器预测结果取平均值的整体称为多数投票分类器 。 当基于概率的集合平均值(如上所述)时,我们将其称为软投票 ,对最终类别标签预测进行平均称为硬投票 。
Of course, ensembles are no silver bullet. You might have noticed in our toy example that for averaging to work, prediction errors must be uncorrelated. If both models made incorrect predictions, the ensemble would not be able to make any corrections. Moreover, in the soft voting scheme, if one model makes an incorrect prediction with a high probability value, the ensemble would be overwhelmed. Generally, ensembles don’t get every observation right, but in expectation it will do better than the underlying models.
当然,合奏不是灵丹妙药。 您可能已经在我们的玩具示例中注意到,为了平均工作,预测误差必须不相关 。 如果两个模型都做出了错误的预测,则集成将无法进行任何更正。 而且,在软投票方案中,如果一个模型以高概率值做出不正确的预测,则该集合将不堪重负。 通常,集成不会正确地进行所有观察,但是可以预期它会比基础模型做得更好。
森林是树木的合奏 (A forest is an ensemble of trees)
Returning to our prediction problem, let’s see if we can build an ensemble out of our two decision trees. We first check error correlation: highly correlated errors makes for poor ensembles.
回到我们的预测问题,让我们看看是否可以从我们的两个决策树中构建一个整体。 我们首先检查错误相关性:高度相关的错误会导致整体效果不佳。
p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]
pd.DataFrame({"full_data": p1,
"red_data": p2}).corr()
p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]
pd.DataFrame({"full_data": p1,
"red_data": p2}).corr()
full_data | 完整数据 | red_data | red_data | ||
full_data | 完整数据 | 1.000000 | 1.000000 | 0.669128 | 0.669128 |
red_data | red_data | 0.669128 | 0.669128 | 1.000000 | 1.000000 |
There is some correlation, but not overly so: there’s still a good deal of prediction variance to exploit. To build our first ensemble, we simply average the two model’s predictions.
有一定的相关性,但并不过分:仍然有很多预测方差可以利用。 为了建立第一个合奏,我们只需对两个模型的预测取平均。
Average of decision tree ROC-AUC score: 0.783
决策树ROC-AUC得分的平均值:0.783
Indeed, the ensemble procedure leads to an increased score. But maybe if we had more diverse trees, we could get an even greater gain. How should we choose which features to exclude when designing the decision trees?
确实,合奏过程导致分数增加。 但是,也许如果我们有更多不同的树木,我们可以获得更大的收益。 设计决策树时,我们应如何选择要排除的特征?
A fast approach that works well in practice is to randomly select a subset of features, fit one decision tree on each draw and average their predictions. This process is known as bootstrapped averaging (often abbreviated bagging), and when applied to decision trees, the resultant model is a Random Forest. Let’s see what a random forest can do for us. We use the Scikit-learn implementation and build an ensemble of 10 decision trees, each fitted on a subset of 3 features.
在实践中行之有效的一种快速方法是随机选择特征子集,在每次绘制中拟合一个决策树,然后将其预测取平均。 此过程称为自举平均 (通常简称为装袋),当应用于决策树时,结果模型为随机森林 。 让我们看看一个随机森林可以为我们做什么。 我们使用Scikit-learn实现并构建10个决策树的集合,每个决策树都适合3个特征的子集。
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=10,
max_features=3,
random_state=SEED
)
rf.fit(xtrain, ytrain)
p = rf.predict_proba(xtest)[:, 1]
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=10,
max_features=3,
random_state=SEED
)
rf.fit(xtrain, ytrain)
p = rf.predict_proba(xtest)[:, 1]
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Average of decision tree ROC-AUC score: 0.844
决策树ROC-AUC得分的平均值:0.844
The Random Forest yields a significant improvement upon our previous models. We’re on to something! But there is only so much you can do with decision trees. It’s time we expand our horizon.
随机森林对我们以前的模型进行了重大改进。 我们正在努力! 但是,决策树只能做很多事情。 是时候扩大我们的视野了。
集成为平均预测 (Ensembles as averaged predictions)
Our foray into ensembles so far has shown us two important aspects of ensembles:
到目前为止,我们对合奏的尝试向我们展示了合奏的两个重要方面:
- The less correlation in prediction errors, the better
- The more models, the better
- 预测误差的相关性越小,越好
- 型号越多越好
For this reason, it’s a good idea to use as different models as possible (as long as they perform decently). So far, we have relied on simple averaging, but later we will see to how use more complex combinations. To keep track of our progress, it is helpful to formalize our ensemble as $n$ models $f_i$ averaged into an ensemble $e$:
因此,最好使用尽可能不同的模型(只要它们表现良好)。 到目前为止,我们仅依靠简单平均,但稍后我们将了解如何使用更复杂的组合。 为了跟踪我们的进度,将我们的集合正式化是有帮助的,因为将$ n $模型$ f_i $平均化为集合$ e $:
$$e(x) = frac1n sum_{i=1}^n f_i(x).$$
$$ e(x)= frac1n sum_ {i = 1} ^ n f_i(x)。$$
There’s no limitation on what models to include: decision trees, linear models, kernel-based models, non-parametric models, neural networks or even other ensembles! Keep in mind though that the more models we include, the slower the ensemble becomes.
对包括哪些模型没有限制:决策树,线性模型,基于核的模型,非参数模型,神经网络甚至其他集合! 请记住,尽管我们包含的模型越多,集成速度就越慢。
To build an ensemble of various models, we begin by benchmarking a set of Scikit-learn classifiers on the dataset. To avoid repeating code, we use the below helper functions:
为了构建各种模型的集合,我们首先在数据集上对一组Scikit学习分类器进行基准测试。 为避免重复代码,我们使用以下辅助函数:
We’re now ready to create a prediction matrix $P$, where each feature corresponds to the predictions made by a given model, and score each model against the test set:
现在我们准备创建一个预测矩阵$ P $,其中每个特征对应于给定模型做出的预测,并针对测试集对每个模型评分:
models = get_models()
P = train_predict(models)
score_models(P, ytest)
models = get_models()
P = train_predict(models)
score_models(P, ytest)
Model | 模型 | Score | 得分 |
svm | 虚拟机 | 0.850 | 0.850 |
knn | nn | 0.779 | 0.779 |
naive bayes | 朴素的贝叶斯 | 0.803 | 0.803 |
mlp-nn | 毫升 | 0.851 | 0.851 |
random forest | 随机森林 | 0.844 | 0.844 |
gbm | gbm | 0.878 | 0.878 |
logistic | 后勤 | 0.854 | 0.854 |
That’s our baseline. The Gradient Boosting Machine (GBM) does best, followed by a simple logistic regression. For our ensemble strategy to work, prediction errors must be relatively uncorrelated. Checking that this holds is our first order of business:
这是我们的基准。 梯度提升机(GBM)表现最佳,其次是简单的逻辑回归。 为了使我们的整体策略有效,预测误差必须相对不相关。 检查这是否成立是我们的首要任务:
Errors are significantly correlated, which is to be expected for models that perform well, since it’s typically the outliers that are hard to get right. Yet most correlations are in the 50-80% span, so there is decent room for improvement. In fact, if we look at error correlations on a class prediction basis things look a bit more promising:
误差显着相关,对于性能良好的模型,这是可以预料的,因为通常很难找到正确的异常值。 然而,大多数相关性都在50-80%的范围内,因此还有很大的改进空间。 实际上,如果我们基于类预测来查看错误相关性,那么事情看起来就更有希望了:
corrmat(P.apply(lambda pred: 1*(pred >= 0.5) - ytest.values).corr(), inflate=False)
plt.show()
corrmat(P.apply(lambda pred: 1*(pred >= 0.5) - ytest.values).corr(), inflate=False)
plt.show()
To create an ensemble, we proceed as before and average predictions, and as we might expect the ensemble outperforms the baseline. Averaging is a simple process, and if we store model predictions, we can start with a simple ensemble and increase its size on the fly as we train new models.
为了创建一个整体,我们像以前一样进行平均预测,并且可以期望整体比基线好。 平均是一个简单的过程,如果我们存储模型预测,我们可以从简单的整体开始,并在训练新模型时即时增加其大小。
Ensemble ROC-AUC score: 0.884
整体ROC-AUC得分:0.884
可视化合奏的工作方式 (Visualizing how ensembles work)
We’ve understood the power of ensembles as an error correction mechanism. This means that ensembles smooth out decision boundaries by averaging out irregularities. A decision boundary shows us how an estimator carves up feature space into neighborhood within which all observations are predicted to have the same class label. By averaging out base learner decision boundaries, the ensemble is endowed with a smoother boundary that generalize more naturally.
我们已经了解了集成作为纠错机制的力量。 这意味着通过平均不合规性来合奏来平滑决策边界。 决策边界向我们展示了估计器如何将特征空间划分为邻域,在该邻域中,所有观测值都被预测为具有相同的类别标签。 通过平均基础学习者的决策边界,该集合具有更平滑的边界,可以更自然地泛化。
The figure below shows this in action. Here, the example is the iris data set, where the estimators try to classify three types of flowers. The base learners all have some undesirable properties in their boundaries, but the ensemble has a relatively smooth decision boundary that aligns with observations. Amazingly, ensembles both increase model complexity and acts as a regularizer!
下图显示了此操作。 在此,示例是虹膜数据集,其中估算器尝试对三种类型的花朵进行分类。 基础学习者的边界都具有一些不良的特性,但是合奏的决策边界相对平滑,与观察结果一致。 令人惊讶的是,集成既增加了模型的复杂性,又充当了正则化器!
Example decision boundaries for three models and an ensemble of the three. Source
三个模型和三个模型的示例决策边界。 资源
Another way to understand what is going on in an ensemble when the task is classification, is to inspect the Receiver Operator Curve (ROC). This curve shows us how an estimator trades off precision and recall. Typically, different base learners make different trade offs: some have higher precision by sacrificing recall, and other have higher recall by sacrificing precision.
了解任务分类时正在处理的事情的另一种方法是检查接收方操作员曲线(ROC)。 该曲线向我们展示了估算器如何权衡精度和查全率。 通常,不同的基础学习者会做出不同的取舍:有些人通过降低召回率来提高准确性,而另一些人则通过降低精度来提高召回率。
A non-linear meta learner, on the other hand, is able to, for each training point, adjust which models it relies on. This means that it can significantly reduce necessary sacrifices and retain high precision while increasing recall (or vice versa). In the figure below, the ensemble is making a much smaller sacrifice in precision to increase recall (the ROC is further in the “northeast” corner).
另一方面,非线性元学习者可以针对每个训练点调整其依赖的模型。 这意味着它可以显着减少必要的牺牲并保持高精度,同时增加查全率(反之亦然)。 在下图中,合奏在精度方面的牺牲要小得多,以提高召回率(ROC进一步位于“东北”角)。
from sklearn.metrics import roc_curve
def plot_roc_curve(ytest, P_base_learners, P_ensemble, labels, ens_label):
"""Plot the roc curve for base learners and ensemble."""
plt.figure(figsize=(10, 8))
plt.plot([0, 1], [0, 1], 'k--')
cm = [plt.cm.rainbow(i)
for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]
for i in range(P_base_learners.shape[1]):
p = P_base_learners[:, i]
fpr, tpr, _ = roc_curve(ytest, p)
plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])
fpr, tpr, _ = roc_curve(ytest, P_ensemble)
plt.plot(fpr, tpr, label=ens_label, c=cm[0])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(frameon=False)
plt.show()
plot_roc_curve(ytest, P.values, P.mean(axis=1), list(P.columns), "ensemble")
from sklearn.metrics import roc_curve
def plot_roc_curve(ytest, P_base_learners, P_ensemble, labels, ens_label):
"""Plot the roc curve for base learners and ensemble."""
plt.figure(figsize=(10, 8))
plt.plot([0, 1], [0, 1], 'k--')
cm = [plt.cm.rainbow(i)
for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]
for i in range(P_base_learners.shape[1]):
p = P_base_learners[:, i]
fpr, tpr, _ = roc_curve(ytest, p)
plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])
fpr, tpr, _ = roc_curve(ytest, P_ensemble)
plt.plot(fpr, tpr, label=ens_label, c=cm[0])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(frameon=False)
plt.show()
plot_roc_curve(ytest, P.values, P.mean(axis=1), list(P.columns), "ensemble")
超越简单的平均合奏 (Beyond ensembles as a simple average)
But wouldn’t you expect more of a boost given the variation in prediction errors? Well, one thing is a bit nagging. Some of the models perform considerably worse than others, yet their influence is just large as better performing models. This can be quite devastating with unbalanced data sets: recall that with soft voting, if a model makes an extreme prediction (i.e close to 0 or 1), that prediction has a strong pull on the prediction average.
但是,由于预测误差的变化,您不希望得到更多的提升吗? 好吧,有一件事有点na。 一些模型的性能远不如其他模型,但它们的影响与性能更好的模型一样大。 对于不平衡的数据集,这可能是非常灾难性的:回想一下,通过软投票,如果模型做出极端预测(即接近0或1),则该预测对预测平均值的影响很大。
An important factor for us is whether models are able to capture the full share of Republican denotations. A simple check shows that all models underrepresent Republican donations, but some are considerably worse than others.
对我们来说,一个重要因素是模型是否能够捕获共和党符号的全部份额。 一个简单的检查显示,所有模型都不足以代表共和党的捐款,但是有些模型比其他模型差很多。
We can try to improve the ensemble by removing the worst offender, say the Multi-Layer Perceptron (MLP):
多层感知器(MLP)说,我们可以尝试通过消除最严重的违规者来改善整体效果:
include = [c for c in P.columns if c not in ["mlp-nn"]]
print("Truncated ensemble ROC-AUC score: %.3f" % roc_auc_score(ytest, P.loc[:, include].mean(axis=1)))
include = [c for c in P.columns if c not in ["mlp-nn"]]
print("Truncated ensemble ROC-AUC score: %.3f" % roc_auc_score(ytest, P.loc[:, include].mean(axis=1)))
Truncated ensemble ROC-AUC score: 0.883
截断的整体ROC-AUC得分:0.883
Not really an improvement: we need a smarter way of prioritizing between models. Clearly, removing models from an ensemble is rather drastic as there may be instances where the removed model carried important information. What we really want is to learn a sensible set of weights to use when averaging predictions. This turns the ensemble into a parametric model that needs to be trained.
并不是真正的改进:我们需要一种在模型之间进行优先级排序的智能方法。 显然,从整体中删除模型相当麻烦,因为在某些情况下,删除的模型会携带重要的信息。 我们真正想要的是在平均预测值时学习一组有意义的权重。 这将集成变成需要训练的参数模型。
学习结合预测 (Learning to combine predictions)
Learning a weighted average means that for each model $f_i$, we have a weight parameter $omega_i in (0, 1)$ that assigns our weight to that model’s predictions. Weighted averaging requires all weights to sum to 1. The ensemble is now defined as
学习加权平均值意味着,对于每个模型$ f_i $,我们在(0,1)$中有一个权重参数$ omega_i,该权重参数将我们的权重分配给该模型的预测。 加权平均要求所有权重之和为1。现在将系综定义为
$$e(x) = sum_{i=1}^n omega_i f_i(x).$$
$$ e(x)= sum_ {i = 1} ^ n omega_i f_i(x)。$$
This is a minor change from our previous definition, but is interesting since, once the models have generated predictions $p_i = f_i(x)$, learning the weights is the same as fitting a linear regression on those predictions:
与先前的定义相比,这是一个很小的变化,但是很有趣,因为一旦模型生成了预测$ p_i = f_i(x)$,学习权重就如同对这些预测进行线性回归一样:
$$e(p_1, …, p_n) = omega_1 p_1 + … + omega_n p_n,$$
$$ e(p_1,…,p_n)= omega_1 p_1 +…+ omega_n p_n,$$
with some constraints on the weights. Then again, there is no reason to restrict ourself to fitting just a linear model. Suppose instead that we fit a nearest neighbor model. The ensemble would then take local averages based on the nearest neighbors of a given observation, empowering the ensemble to adapt to changes in model performance as the input varies.
在权重上有一些限制。 再一次,没有理由将自己限制为仅拟合线性模型。 相反,假设我们拟合最近的邻居模型。 然后,集成体将基于给定观测值的最近邻取局部平均值,从而使集成体能够随着输入变化而适应模型性能的变化。
实施合奏 (Implementing an ensemble)
To build this type of ensemble, we need three things:
要构建这种合奏,我们需要三件事:
- a library of base learners that generate predictions
- a meta learner that learns how to best combine these predictions
- a method for splitting the training data between the base learners and the meta learner.
- 生成预测的基础学习者库
- 一个元学习者 ,学习如何最佳地结合这些预测
- 一种在基础学习者和元学习者之间划分训练数据的方法。
Base learners are the ingoing models that take the original input and generate a set of predictions. If we have an original data set ordered as a matrix $X$ of shape (n_samples, n_features)
, the library of base learners output a new prediction matrix $P_{text{base}}$ of size (n_samples, n_base_learners)
, where each column represent the predictions made by one of the base learners. The meta learner is trained on $P_{text{base}}$.
基础学习者是传入的模型,这些模型采用原始输入并生成一组预测。 如果我们有一个原始数据集以形状为$ X $的形状矩阵(n_samples, n_features)
,则基础学习者库将输出一个新的大小为(n_samples, n_base_learners)
预测矩阵$ P_ {text {base}} $,其中每列代表一位基础学习者做出的预测。 元学习者接受了$ P_ {text {base}} $的培训。
This means that it is absolutely crucial to handle the training set $X$ in an appropriate way. In particular, if we both train the base learners on $X$ and have them predict $X$, the meta learner will be training on the base learner’s training error, but at test time it will face their test errors.
这意味着以适当的方式处理训练集$ X $绝对至关重要。 特别是,如果我们都在$ X $上训练基本学习者并让他们预测$ X $,则元学习者将针对基本学习者的训练错误进行训练,但是在测试时,它将面临他们的测试错误。
We need a strategy for generating a prediction matrix $P$ that reflects test errors. The simplest strategy is to split the full data set $X$ in two: train the base learners on one half and have them predict the other half, which then becomes the input to the meta learner. While simple and relatively fast, we loose quite a bit of data. For small and medium sized data sets, the loss of information can be severe, causing the base learners and the meta learner to perform poorly.
我们需要一种策略来生成反映测试错误的预测矩阵$ P $。 最简单的策略是将整个数据集$ X $分成两部分:训练基础学习者一半,让他们预测另一半,然后成为元学习者的输入。 尽管简单且相对较快,但我们丢失了大量数据。 对于中小型数据集,信息丢失可能会很严重,从而导致基础学习者和元学习者的表现不佳。
To ensure the full data set is covered, we can use cross-validation, a method initially developed for validating test-set performance during model selection. There are many ways to perform cross-validation, and before we delve into that, let’s get a feel for this type of ensemble by implementing one ourselves, step by step.
为了确保涵盖完整的数据集,我们可以使用交叉验证,这是最初开发的用于在模型选择期间验证测试集性能的方法。 执行交叉验证的方法有很多,在深入研究交叉验证之前,让我们一步一步地实现自己对这种集成的感觉。
步骤1:定义基础学习者库 (Step 1: define a library of base learners)
These are the models that take the raw input data and generates predictions, and can be anything from linear regression to a neural network to another ensemble. As always, there’s strength in diversity! The only thing to consider is that the more models we add, the slower the ensemble will be. Here, we’ll use our set of models from before:
这些模型采用原始输入数据并生成预测,并且可以是从线性回归到神经网络再到另一个整体的任何模型。 一如既往,多元化的力量! 唯一要考虑的是,我们添加的模型越多,集成速度就越慢。 在这里,我们将使用之前的模型集:
步骤2:定义元学习器 (Step 2: define a meta learner)
Which meta learner to use is not obvious, but popular choices are linear models, kernel-based models (SVMs and KNNS) and decision tree based models. But you could also use another ensemble as “meta learner”: in this special case, you end up with a two-layer ensemble, akin to a feed-forward neural network.
使用哪个元学习器并不明显,但是流行的选择是线性模型,基于内核的模型(SVM和KNNS)以及基于决策树的模型。 但是,您也可以将另一个合奏用作“元学习器”:在这种特殊情况下,最终会得到类似于前馈神经网络的两层合奏。
Here, we’ll use a Gradient Boosting Machine. To ensure the GBM explores local patterns, we restricting each of 1000 decision trees to train on a random subset of 4 base learners and 50% of input data. This way, the GBM will be exposed to each base learner’s strength in different neighborhoods of the input space.
在这里,我们将使用梯度提升机。 为了确保GBM探索局部模式,我们限制了1000个决策树中的每一个,以对4个基础学习者和50%的输入数据的随机子集进行训练。 这样,GBM将在输入空间的不同邻域中暴露于每个基础学习者的实力。
meta_learner = GradientBoostingClassifier(
n_estimators=1000,
loss="exponential",
max_features=4,
max_depth=3,
subsample=0.5,
learning_rate=0.005,
random_state=SEED
)
meta_learner = GradientBoostingClassifier(
n_estimators=1000,
loss="exponential",
max_features=4,
max_depth=3,
subsample=0.5,
learning_rate=0.005,
random_state=SEED
)
步骤3:定义生成训练和测试集的过程 (Step 3: define a procedure for generating train and test sets)
To keep things simple, we split the full training set into a training and prediction set of the base learners. This method is sometimes referred to as Blending. Unfortunately, the terminology differs between communities, so it’s not always easy to know what type of cross-validation the ensemble is using.
为简单起见,我们将完整的培训集分为基础学习者的培训和预测集。 该方法有时称为Blending 。 不幸的是,社区之间的术语有所不同,因此要知道集成正在使用哪种类型的交叉验证并不总是那么容易。
We now have one training set of the base learners $(X_{text{train_base}}, y_{text{train_base}})$ and one prediction set $(X_{text{pred_base}}, y_{text{pred_base}})$ and are ready to generate the prediction matrix for the meta learner.
现在,我们有一组训练的基础学习者$(X_ {text {train_base}},y_ {text {train_base}})$和一个预测集$(X_ {text {pred_base}},y_ {text {pred_base}} )$并准备为元学习者生成预测矩阵。
步骤4:在训练集上训练基础学习者 (Step 4: train the base learners on a training set)
To train the library of base learners on the base-learner training data, we proceed as usual:
为了根据基础学习者培训数据对基础学习者图书馆进行培训,我们照常进行:
def train_base_learners(base_learners, inp, out, verbose=True):
"""Train all base learners in the library."""
if verbose: print("Fitting models.")
for i, (name, m) in enumerate(base_learners.items()):
if verbose: print("%s..." % name, end=" ", flush=False)
m.fit(inp, out)
if verbose: print("done")
def train_base_learners(base_learners, inp, out, verbose=True):
"""Train all base learners in the library."""
if verbose: print("Fitting models.")
for i, (name, m) in enumerate(base_learners.items()):
if verbose: print("%s..." % name, end=" ", flush=False)
m.fit(inp, out)
if verbose: print("done")
To train the base learners, execute
要培训基础学习者,请执行
步骤5:生成基础学习者预测 (Step 5: generate base learner predictions)
With the base learners fitted, we can now generate a set of predictions for the meta learner to train on. Note that we generate predictions for observations not used to train the base learners. For each observation $x_{text{pred}}^{(i)} in X_{text{pred_base}}$ in the base learner prediction set, we generate a set of base learner predictions:
安装了基础学习器之后,我们现在可以生成一组预测以供元学习者进行培训。 请注意,我们针对不用于训练基础学习者的观察结果生成预测。 对于基本学习者预测集中X_ {text {pred_base}} $中的每个观察值$ x_ {text {pred _}} ^ {(i)},我们生成一组基本学习者预测:
$$p_{text{base}}^{(i)} = left( f_1(x_{text{pred}}^{(i)}) , …, f_n(x_{text{pred}}^{(i)} ) right).$$
$$ p_ {text {base}} ^ {(i)} =左(f_1(x_ {text {pred}} ^ {{i)}),…,f_n(x_ {text {pred}} ^^ {(i )})对)。$$
If you implement your own ensemble, pay special attention to how you index the rows and columns of the prediction matrix. When we split the data in two, this is not so hard, but with cross-validation things are more challenging.
如果您实现自己的集合,请特别注意如何索引预测矩阵的行和列。 当我们将数据一分为二时,这并不是很难,但是使用交叉验证,则更具挑战性。
def predict_base_learners(pred_base_learners, inp, verbose=True):
"""Generate a prediction matrix."""
P = np.zeros((inp.shape[0], len(pred_base_learners)))
if verbose: print("Generating base learner predictions.")
for i, (name, m) in enumerate(pred_base_learners.items()):
if verbose: print("%s..." % name, end=" ", flush=False)
p = m.predict_proba(inp)
# With two classes, need only predictions for one class
P[:, i] = p[:, 1]
if verbose: print("done")
return P
def predict_base_learners(pred_base_learners, inp, verbose=True):
"""Generate a prediction matrix."""
P = np.zeros((inp.shape[0], len(pred_base_learners)))
if verbose: print("Generating base learner predictions.")
for i, (name, m) in enumerate(pred_base_learners.items()):
if verbose: print("%s..." % name, end=" ", flush=False)
p = m.predict_proba(inp)
# With two classes, need only predictions for one class
P[:, i] = p[:, 1]
if verbose: print("done")
return P
To generate predictions, execute
要生成预测,请执行
6.训练元学习者 (6. Train the meta learner)
The prediction matrix $P_{text{base}}$ reflects test-time performance and can be used to train the meta learner:
预测矩阵$ P_ {text {base}} $反映了测试时间表现,可用于训练元学习者:
meta_learner.fit(P_base, ypred_base)
meta_learner.fit(P_base, ypred_base)
That’s it! We now have a fully trained ensemble that can be used to predict new data. To generate a prediction for some observation $x^{(j)}$, we first feed it to the base learners. These output a set of predictions
而已! 现在,我们有一个训练有素的整体,可以用来预测新数据。 为了生成对某些观测值$ x ^ {(j)} $的预测,我们首先将其提供给基础学习者。 这些输出一组预测
$$p_{text{base}}^{(j)} = left( f_1(x^{(j)}) , …, f_n(x^{(j)}) right)$$
$$ p_ {text {base}} ^ {(j)} =左(f_1(x ^ {(j)}),…,f_n(x ^ {{j)})右)$$
that we feed to the meta learner. The meta learner then gives us the ensemble’s final prediction
我们喂给元学习者。 元学习者然后给我们整体的最终预测
$$p^{(j)} = mleft(p_{text{base}}^{(j)} right).$$
$$ p ^ {((j)} = mleft(p_ {text {base}} ^ {(j)} right)。$$
Now that we have a firm understanding of ensemble learning, it’s time to see what it can do to improve our prediction performance on the political contributions data set:
现在,我们对整体学习有了深刻的了解,是时候看看该怎么做才能提高我们对政治捐款数据集的预测效果:
To generate predictions, execute
要生成预测,请执行
P_pred, p = ensemble_predict(base_learners, meta_learner, xtest)
print("nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
P_pred, p = ensemble_predict(base_learners, meta_learner, xtest)
print("nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Ensemble ROC-AUC score: 0.881
整体ROC-AUC得分:0.881
As expected, the ensemble beats the best estimator from our previous benchmark, but it doesn’t beat the simple average ensemble. That’s because we trained the base learners and the meta learner on only half the data, so a lot of information is lost. To prevent this, we need to use a cross-validation strategy.
不出所料,该集合击败了我们先前基准测试中的最佳估计量,但没有击败简单的平均集合。 这是因为我们仅对一半的数据进行了基础学习者和元学习者的培训,因此丢失了很多信息。 为了防止这种情况,我们需要使用交叉验证策略。
交叉验证训练 (Training with cross-validation)
During cross-validated training of the base learners, a copy of each base learner is fitted on $K-1$ folds, and predict the left-out fold. This process is iterated until every fold has been predicted. The more folds we specify, the less data is being left out in each training pass. This makes cross-validated predictions less noisy and a better reflection of performance during test time. The cost is significantly increased training time. Fitting an ensemble with cross-validation is often referred to as stacking, while the ensemble itself is known as the Super Learner.
在对基础学习者进行交叉验证的训练期间,将每个基础学习者的副本放置在$ K-1 $折痕上,并预测剩余的折痕。 重复此过程,直到已预测出每个折叠。 我们指定的折叠次数越多,每次训练通过中遗漏的数据就越少。 这使得交叉验证的预测噪声较小,并且可以更好地反映测试期间的性能。 成本大大增加了培训时间。 使一个具有交叉验证的合奏通常被称为堆叠 ,而该合奏本身被称为超级学习者 。
To understand how cross-validation works, we can think of it as an outer loop over our previous ensemble. The outer loop iterates over $K$ distinct test folds, with the remaining data used for training. The inner loop trains the base learners and generate predictions for the held-out data. Here’s a simple stacking implementation:
要了解交叉验证的工作原理,我们可以将其视为先前集成的外部循环。 外循环遍历$ K $个不同的测试折叠,其余数据用于训练。 内循环训练基础学习者,并为保留的数据生成预测。 这是一个简单的堆栈实现:
Let’s go over the steps involved here. First, we fit our final base learners on all data: in contrast with our previous blend ensemble, base learners used at test time are trained on all available data. We then loop over all folds, then loop over all base learners to generate cross-validated predictions. These predictions are stacked to build the training set for the meta learner, which too sees all data.
让我们回顾一下这里涉及的步骤。 首先,我们使最终的基础学习者适合所有数据:与我们之前的混合合奏相反,在测试时使用的基础学习者将接受所有可用数据的培训。 然后,我们遍历所有折叠,然后遍历所有基础学习者以生成交叉验证的预测。 这些预测被堆叠起来以构建元学习者的训练集,该元学习者也可以看到所有数据。
The basic difference between blending and stacking is therefore that stacking allows both base learners and the meta learner to train on the full data set. Using 2-fold cross-validation, we can measure the difference this makes in our case:
因此,混合和堆叠之间的基本区别在于,堆叠使基础学习者和元学习者都可以在完整的数据集上进行训练。 使用2倍交叉验证,我们可以测量这种情况下的差异:
from sklearn.model_selection import KFold
# Train with stacking
cv_base_learners, cv_meta_learner = stacking(
get_models(), clone(meta_learner), xtrain.values, ytrain.values, KFold(2))
P_pred, p = ensemble_predict(cv_base_learners, cv_meta_learner, xtest, verbose=False)
print("nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
from sklearn.model_selection import KFold
# Train with stacking
cv_base_learners, cv_meta_learner = stacking(
get_models(), clone(meta_learner), xtrain.values, ytrain.values, KFold(2))
P_pred, p = ensemble_predict(cv_base_learners, cv_meta_learner, xtest, verbose=False)
print("nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Ensemble ROC-AUC score: 0.889
整体ROC-AUC得分:0.889
Stacking yields a sizeable increase in performance: in fact, it gives us our best score so far. This outcome is typical for small and medium-sized data sets, where the effect of blending can be severe. As the data set size increases, blending and stacking performs similarly.
堆叠可以显着提高性能:事实上,它为我们提供了迄今为止最好的成绩。 对于混合影响可能很严重的中小型数据集,此结果是典型的。 随着数据集大小的增加,混合和堆叠的执行方式相似。
Stacking comes with its own set of shortcomings, particularly speed. In general, we need to be aware of there important issues when it comes to implementing ensembles with cross-validation:
堆叠具有其自身的缺点,特别是速度。 总的来说,在实现带有交叉验证的集成时,我们需要意识到一些重要的问题:
- Computational complexity
- Structural complexity (risk of information leakage)
- Memory consumption
- 计算复杂度
- 结构复杂性(信息泄漏的风险)
- 内存消耗
It’s important to understand these in order to work with ensembles efficiently, so let’s go through each in turn.
重要的是要理解这些内容,以便高效地处理合奏,因此让我们依次研究一下。
1.计算复杂度 (1. Computational complexity)
Suppose we want to use 10 folds for stacking. This would require training all base learners 10 times on 90% of the data, and once on all data. With 4 base learners, the ensemble would roughly be 40 times slower than using the best base learner.
假设我们要使用10折来堆叠。 这将需要对所有基础学习者的90%的数据进行10次培训,而对所有数据进行一次培训。 如果使用4个基础学习器,则集成速度大约比使用最佳基础学习器慢40倍。
But each cv-fit is independent, so we don’t need to fit models sequentially. If we could fit all folds in parallel, the ensemble would only be roughly 4 times slower than the best base learner, a dramatic improvement. Ensembles are prime candidates for parallelization, and it is critical to leverage this capability to the greatest extent possible. Fitting all folds for all models in parallel, the time penalty for the ensemble would be negligible. To hone this point in, below is a benchmark from ML-Ensemble that shows the time it takes to fit an ensemble via stacking or blending either sequentially or in parallel on 4 threads.
但是每个简历拟合都是独立的,因此我们不需要顺序拟合模型。 如果我们可以并行拟合所有折叠,则该集成仅比最佳基础学习者慢大约4倍,这是一个巨大的进步。 集成是并行化的主要候选对象,因此在最大程度上利用此功能至关重要。 并行拟合所有模型的所有折痕,则合奏的时间损失可以忽略不计。 为了说明这一点,下面是ML-Ensemble的基准测试,该基准测试显示了通过在4个线程上顺序或并行堆叠或混合来集成一个集合所需的时间。
Even with this moderate degree of parallelism, we can realize a sizeable reduction in computation time. But parallelization is associated with a whole host of potentially thorny issues such as race conditions, deadlocks and memory explosion.
即使具有中等程度的并行度,我们也可以实现计算时间的大幅减少。 但是并行化与许多潜在的棘手问题相关,例如竞争条件,死锁和内存爆炸。
2.结构复杂性 (2. Structural complexity)
Once we decide to use the entire training set to meta learner, we must worry about information leakage. This phenomena arises when we mistakenly predict samples that were used during training, for instance by mixing up our folds or using a model trained on the wrong subset. When there’s information leakage in the training set of the meta learner, it will not learn to properly correct for base learner predictions errors: garbage in, garbage out. Spotting such bugs though is extremely difficult.
一旦决定对元学习者使用整个培训集,就必须担心信息泄漏 。 当我们错误地预测训练过程中使用的样本时,例如通过混合我们的褶皱或使用在错误的子集上训练的模型,就会出现这种现象。 当元学习者的训练集中出现信息泄漏时,它将无法学习正确纠正基本学习者的预测错误:垃圾进,垃圾出。 但是发现此类错误非常困难。
3.内存消耗 (3. Memory consumption)
The final issue arises with parallelization, especially by multi-processing as is often the case in Python. In this case, each sub-process has its own memory and therefore needs to copy all data from the parent process. A naive implementation will therefore copy all data to all processes, eating up memory and wasting time on data serialization. Preventing this requires sharing data memory, which in turns easily cause data corruption.
最终的问题与并行化有关,尤其是在Python中经常发生的多处理。 在这种情况下,每个子流程都有自己的内存,因此需要从父流程复制所有数据。 因此,幼稚的实现会将所有数据复制到所有进程,从而耗尽内存并浪费数据序列化时间。 要防止这种情况,需要共享数据内存,从而容易导致数据损坏。
结果:使用软件包 (Upshot: use packages)
The upshot is that you should use a unit-tested package and focus on building your machine learning pipeline. In fact, once you’ve settled on a ensemble package, building ensembles becomes really easy: all you need to do is specify the base learners, the meta learner, and a method for training the ensemble.
结果是您应该使用经过单元测试的软件包,并专注于构建机器学习管道。 实际上,一旦您确定了集成包,构建集成就变得非常容易:您需要做的就是指定基础学习者,元学习者以及训练集成的方法。
Fortuntately, there are many packages available in all popular programming languages, though they come in different flavors. At the end of this post, we list some as reference. For now, let’s pick one and see how a stacked ensemble does on our political contributions data set. Here, we use ML-Ensemble and build our previous generalized ensemble, but now using 10-fold cross-validation:
幸运的是,尽管它们以不同的风格出现,但所有流行的编程语言都有许多可用的软件包。 在这篇文章的结尾,我们列出了一些作为参考。 现在,让我们选择一个,看看堆叠后的合奏如何对我们的政治捐款数据集起作用。 在这里,我们使用ML-Ensemble构建先前的广义集成,但现在使用10倍交叉验证:
Fitting 2 layers
Processing layer-1 done | 00:02:03
Processing layer-2 done | 00:00:03
Fit complete | 00:02:08
Predicting 2 layers
Processing layer-1 done | 00:00:50
Processing layer-2 done | 00:00:02
Predict complete | 00:00:54
Super Learner ROC-AUC score: 0.890
Fitting 2 layers
Processing layer-1 done | 00:02:03
Processing layer-2 done | 00:00:03
Fit complete | 00:02:08
Predicting 2 layers
Processing layer-1 done | 00:00:50
Processing layer-2 done | 00:00:02
Predict complete | 00:00:54
Super Learner ROC-AUC score: 0.890
It’s as simple as that!
就这么简单!
Inspecting the ROC-curve of the super learner against the simple average ensemble reveals how leveraging the full data enables the super learner to sacrifice less recall for a given level of precision.
根据简单的平均集合检查超级学习者的ROC曲线,可以发现利用完整数据如何使超级学习者在给定的精度水平上牺牲较少的查全率。
从这往哪儿走 (Where to go from here)
There are many other types of ensembles than those presented here. However the basic ingredients are always the same: a library of base learners, a meta learner, and a training procedure. By playing around with these components, various specialized forms of ensembles can be created. A good starting point for more advanced material on ensemble learning is this excellent post by mlware.
除此处介绍的合奏外,还有许多其他类型的合奏。 但是,基本要素始终是相同的:基础学习者库,元学习者和培训程序。 通过使用这些组件,可以创建各种特殊形式的合奏。 mlware的这篇出色文章是集成学习的高级材料的一个很好的起点。
When it comes to software, it’s a matter of taste. As the popularity of ensembles have risen, so has the number of packages available. Ensembles were traditionally developed in the statistics community, so R has had a lead on purpose-built libraries. Several packages have recently been developed in Python and other languages, with more on the way. Each package caters to different needs and are at different stages of maturity, so we recommend shopping around until you find what you are looking for.
在软件方面,这是一个品味问题。 随着合奏的流行,可用包装的数量也增加了。 合奏传统上是在统计界开发的,因此R在专门构建的库方面处于领先地位。 最近已经用Python和其他语言开发了几个软件包,并且还在开发中。 每种包装都可以满足不同的需求,并且处于不同的成熟阶段,因此我们建议您货比三家,直到找到所需的东西。
Here are a few packages to get you started:
以下是一些入门指南:
Language | 语言 | Name | 名称 | Comment | 评论 |
Python | Python | ML-EnsembleML-合奏 | Purpose-built ensemble learning package: deep ensemble learning | 专用合奏学习包:深度合奏学习 | |
Python | Python | Scikit-learnScikit学习 | Bagging, majority voting classifiers. API for basic stacking in development | 套袋,多数投票分类。 开发中用于基本堆栈的API | |
Python | Python | mlxtendmlxtend | Regression and Classification ensembles without cross-validation | 没有交叉验证的回归和分类合奏 | |
R | [R | SuperLearner超级学习者 | Super Learner ensembles | 超级学习者合奏 | |
R | [R | Subsemble组合 | Subsembles | 组件 | |
R | [R | caretEnsemble插入符合奏 | Ensembles of Caret estimators | Caret估计器的集合 | |
Mutliple | 多重 | H20H20 | Distributed stacked ensemble learning. Limited to estimators in the H20 library | 分布式堆叠集成学习。 仅限于H20库中的估计量 | |
Java | Java | StackNet栈网 | Empowered by H20 | 由H20授权 | |
Web-based | 基于网络 | xcessivXcessiv | Web-based ensemble learning | 基于网络的整体学习 |
翻译自: https://www.pybloggers.com/2018/01/introduction-to-python-ensembles/
python简介