随机森林计算特征重要性



The feature importance describes which features are relevant. It can help with a better understanding of the solved problem and sometimes lead to model improvement by utilizing feature selection. In this post, I will present 3 ways (with code) to compute feature importance for the Random Forest algorithm from scikit-learn package (in Python).

功能重要性描述了哪些功能是相关的。 它可以帮助您更好地了解已解决的问题,有时还可以利用特征选择来改进模型。 在这篇文章中,我将介绍3种方法(使用代码),以scikit-learn包(在Python中)为随机森林算法计算功能重要性。

(Built-in Random Forest Importance)

The Random Forest algorithm has built-in feature importance which can be computed in two ways:

随机森林算法具有内置的特征重要性,可以通过两种方式计算:

  • Gini importance (or mean decrease impurity), which is computed from the Random Forest structure. Let’s look at how the Random Forest is constructed. It is a set of Decision Trees. Each Decision Tree is a set of internal nodes and leaves. In the internal node, the selected feature is used to make a decision on how to divide the data set into two separate sets with similar responses within. The features for internal nodes are selected with some criterion, which for classification tasks can be Gini impurity or information gain, and for regression is variance reduction. We can measure how each feature decreases the impurity of the split (the feature with the highest decrease is selected for internal node). For each feature, we can collect how on average it decreases the impurity. The average over all trees in the forest is the measure of the feature importance. This method is available in scikit-learn the implementation of the Random Forest (for both classifier and regressor). It is worth mentioning that in this method, we should look at the relative values of the computed importances. This biggest advantage of this method is the speed of computation - all needed values are computed during the Radom Forest training. The drawback of the method is a tendency to prefer (select as important) numerical features and categorical features with high cardinality. What is more, in the case of correlated features it can select one of the features and neglect the importance of the second one (which can lead to wrong conclusions).
    基尼重要性 (或平均减少杂质),由随机森林结构计算得出。 让我们看看随机森林是如何构建的。 它是一组决策树。 每个决策树都是一组内部节点和叶子。 在内部节点中,所选功能用于决定如何将数据集分为两个单独的集合,并在其中进行类似的响应。 内部节点的特征按某些标准选择,对于分类任务可以是基尼杂质或信息增益,对于回归是方差减少。 我们可以测量每个特征如何减少分割的杂质(为内部节点选择具有最大减少的特征)。 对于每个功能,我们可以收集平均如何减少杂质。 森林中所有树木的平均值是特征重要性的度量。 scikit-learn可以使用此方法来scikit-learn随机森林的实现 (对于分类器和回归器)。 值得一提的是,在这种方法中,我们应该查看计算出的重要性的相对值。 此方法的最大优点是计算速度-在Radom Forest训练期间计算所有需要的值。 该方法的缺点是倾向于倾向于(选择作为重要的)具有高基数的数字特征和分类特征。 而且,在具有相关特征的情况下,它可以选择特征之一,而忽略第二个特征的重要性(这可能导致错误的结论)。
  • Mean Decrease Accuracy — is a method of computing the feature importance on permuted out-of-bag (OOB) samples based on a mean decrease in the accuracy. This method is not implemented in the scikit-learn package. The very similar to this method is permutation-based importance described below in this post.
    平均降低准确性 -是一种基于准确性的平均降低来计算置换袋装(OOB)样本的特征重要性的方法。 scikit-learn包中未实现此方法。 与此方法非常相似的是下文中介绍的基于置换的重要性。

I will show how to compute feature importance for the Random Forest with scikit-learn package and Boston dataset (house price regression task).

我将展示如何使用scikit-learn软件包和Boston数据集(房价回归任务)来计算随机森林的特征重要性。

# Let's load the packagesimport numpy as npimport pandas as pdfrom sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.inspection import permutation_importanceimport shapfrom matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

Load the data set and split for training and testing.

加载数据集并拆分以进行训练和测试。

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

Fit the Random Forest Regressor with 100 Decision Trees:

使随机森林回归器具有100个决策树:

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

To get the feature importances from the Random Forest model use the feature_importances_ argument:

要从“随机森林”模型中获取要素重要性,请使用feature_importances_参数:

rf.feature_importances_array([0.04054781, 0.00149293, 0.00576977, 0.00071805, 0.02944643,
       0.25261155, 0.01969354, 0.05781783, 0.0050257 , 0.01615872,
       0.01066154, 0.01185997, 0.54819617])

Let’s plot the importances (chart will be easier to interpret than values).

让我们来画出重要性(图表比值更容易解释)。

plt.barh(boston.feature_names, rf.feature_importances_)

To have an even better chart, let’s sort the features, and plot again:

为了获得更好的图表,让我们对特征进行排序,然后再次绘图:

sorted_idx = rf.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")




随机森林回归特征选择技术python 随机森林特征重要性python_算法

(Permutation-based Importance)

The permutation-based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. It is implemented in scikit-learn as permutation_importance method. As arguments it requires trained model (can be any model compatible with scikit-learn API) and validation (test data). This method will randomly shuffle each feature and compute the change in the model's performance. The features which impact the performance the most are the most important one.

基于置换的重要性可用于克服使用平均杂质减少计算出的默认特征重要性的缺点。 它在scikit-learn作为permutation_importance方法实现。 作为参数,它需要训练有素的模型(可以是与scikit-learn API兼容的任何模型)和验证(测试数据)。 该方法将随机地对每个功能进行随机排序,并计算模型性能的变化。 最影响性能的功能是最重要的功能。

Permutation importance can be easily computed:

排列重要性很容易计算:

perm_importance = permutation_importance(rf, X_test, y_test)

To plot the importance:

绘制重要性:

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")


随机森林回归特征选择技术python 随机森林特征重要性python_随机森林回归特征选择技术python_02

The permutation-based importance is computationally expensive. The permutation-based method can have problems with highly-correlated features, it can report them as unimportant.

基于排列的重要性在计算上很昂贵。 基于置换的方法可能会遇到功能高度相关的问题,可以将其报告为不重要的。

(Compute Importance from SHAP Values)

The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate how does each feature contributes to the prediction. It can be easily installed ( pip install shap) and used with scikit-learn Random Forest:

可以使用SHAP解释(与模型无关)来计算随机森林中的特征重要性。 它使用博弈论中的Shapley值来估计每个特征如何对预测做出贡献。 它可以轻松安装( pip install shap )并与scikit-learn随机森林一起使用:

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

To plot feature importance as the horizontal bar plot we need to use summary_plot the method:

要将要素重要性绘制为水平条形图,我们需要使用summary_plot方法:

shap.summary_plot(shap_values, X_test, plot_type="bar")


随机森林回归特征选择技术python 随机森林特征重要性python_python_03

The feature importance can be plotted with more details, showing the feature value:

可以使用更多细节绘制特征重要性,以显示特征值:

shap.summary_plot(shap_values, X_test)


随机森林回归特征选择技术python 随机森林特征重要性python_算法_04

The computing feature importances with SHAP can be computationally expensive. However, it can provide more information like decision plots or dependence plots.

SHAP对计算功能的重要性在计算上可能很昂贵。 但是,它可以提供更多信息,例如决策图或依赖图。

(Summary)

The 3 ways to compute the feature importance for the scikit-learn Random Forest was presented:

提出了三种计算scikit-learn随机森林特征重要性的方法:

  • built-in feature importance
  • permutation-based importance
  • computed with SHAP values

In my opinion, it is always good to check all methods and compare the results. I’m using permutation and SHAP based methods in MLJAR’s AutoML open-source package mljar-supervised. I'm using them because they are model-agnostic and works well with algorithms not from scikit-learn: Xgboost, Neural Networks (keras+tensorflow), LigthGBM, CatBoost.

我认为,检查所有方法并比较结果总是好的。 我在MLJAR的AutoML开源软件包mljar-supervised使用基于置换和SHAP的方法。 我之所以使用它们,是因为它们与模型无关,并且可以很好地与scikit-learn算法配合使用:Xgboost,神经网络(keras + tensorflow),LigthGBM,CatBoost。

(Important Notes)

  • The more accurate model is, the more trustworthy computed importance is.
  • The computed importances describe how important features are for the machine learning model. It is an approximation of how important features are in the data

The mljar-supervised is an open-source Automated Machine Learning (AutoML) Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model. It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

mljar监督的是可处理表格数据的开源自动机器学习(AutoML)Python软件包。 它旨在为数据科学家节省时间。 它抽象了预处理数据,构建机器学习模型以及执行超参数调整以找到最佳模型的通用方法。 这不是黑盒子,因为您可以确切地看到ML管道的构造方式(每个ML模型都有详细的Markdown报告)。


随机森林回归特征选择技术python 随机森林特征重要性python_随机森林回归特征选择技术python_05

The example report generated with a mljar-supervised AutoML package. 使用mljar监督的AutoML软件包生成的示例报告。

Originally published at https://mljar.com on June 29, 2020.

最初于 2020年6月29日 发布在 https://mljar.com 上。


翻译自: https://towardsdatascience.com/the-3-ways-to-compute-feature-importance-in-the-random-forest-96c86b49e6d4

随机森林计算特征重要性