forest函数R包 random forest algorithm

转载

mob64ca140ce312 2024-08-06 20:28:52

文章标签 forest函数R包 python 机器学习随机森林 sed 文章分类 云原生云计算

(About this article)

In this article, we will try to get a deeper understanding of what each of the parameters does in the Random Forest algorithm. This is not an explanation of how the algorithm works. ( You might want to start with a simple explanation of how the algorithm works, found here — A pictorial guide to understanding Random Forest Algorithm.)

在本文中，我们将尝试更深入地了解随机森林算法中每个参数的作用。这不是该算法如何工作的解释。 (您可能想对算法的工作原理做一个简单的解释，请参见- 了解随机森林算法的图形指南。)

(Packages)

The packages we will be looking at are

我们将要查看的软件包是

sklearn.ensemble.RandomForestClassifier

( for the Random Forest Classifier algorithm found in the sklearn library )

(用于sklearn库中的“随机森林分类器”算法)

sklearn.ensemble.RandomForestRegressor

(for the Random Forest regressor algorithm)

(用于随机森林回归算法)

随机森林分类器-参数 (Random Forest Classifier — parameters)

n_estimators ( default = 100 ) n_estimators (默认= 100)

Since the RandomForest algorithm is an ensemble modelling technique, it ‘increases the generalization’ by creating a number of different kinds of trees with different depth and sizes. n_estimators is the number of trees you want the algorithm to create .

由于RandomForest算法是一种集成建模技术，因此它通过创建许多具有不同深度和大小的不同种类的树来“ 提高泛化性 ”。 n_estimators是您要算法创建的树数。

Random Forest is an ensemble modelling technique ( Image by Author) 随机森林是一种整体建模技术(图片由作者提供)

2. criterion (default = gini)

2. 条件 (默认=基尼)

The measure to determine where/on what feature a tree has to be split can be determined by two methods — by calculating the gini-impurity or by entropy.

确定树必须在何处分割/在什么特征上分割的度量可以通过两种方法确定-通过计算基尼不纯或通过熵。

For example, suppose there are two parameters gender and nationality based on which the splitting of the tree has to be done. The algorithm does the split on using both the features and chooses the one which has a lower entropy or a lower gini-impurity. Whichever feature results in a lower entropy after the split, the algorithm chooses that feature to be split.

例如，假设有两个参数“ gender和“ nationality根据这些参数来进行树的拆分。该算法使用这两个特征进行分割，然后选择熵或吉尼杂质较低的特征。无论哪个特征在分割后导致较低的熵，算法都会选择要分割的特征。

split happens if entropy or gini-impurity reduces ( image by Author)

如果熵或基尼杂质减少，则会发生分裂(作者提供的图像)

3. max_depth (default = None)

3. max_depth ( 默认= None )

is the measure of how much further the tree has to be expanded down to each node till we get to the leaf node.

是将树进一步扩展到每个节点直到到达叶节点之前的度量。

Side Note : Generally in a tree based algorithm, the more the depth, more the chance that it over-fits the data. Since Random Forest ensembles several different trees together, it is generally accepted to have deep trees.

旁注：通常，在基于树的算法中，深度越大，它过度适合数据的机会就越大。由于随机森林将几棵不同的树融合在一起，因此通常公认有深树。

what is depth ( Image by Author)

什么是深度(作者提供的图像)

4. min_samples_split (default = 2)

4. min_samples_split ( 默认= 2 )

We can specify the minimum number of elements/records that has to be present in each node to determine if the algorithm can stop splitting further.

我们可以指定每个节点中必须存在的元素/记录的最小数量，以确定算法是否可以停止进一步拆分。

eg : If we mention the min_sample_split to be 60 , after 4 splits if the node still has more than 60 elements or records, it is a potential candidate to be split further. i.e. the splitting continues as long as there are more than 60 records.

例如：如果我们提到min_sample_split为60，则在4个分割之后，如果该节点仍具有60个以上的元素或记录，则可能是进一步分割的潜在候选者。也就是说，只要有60条以上的记录，拆分就会继续。

When does splitting end ( image by Author) 分割何时结束(作者提供的图片)

5. max_features (default = auto)

5. max_features (默认=自动)

At every split, the algorithm chooses some features ( randomly ) to be based on which the tree starts to split. max_features determines how many features need to be selected for determining the split.

在每次拆分时，算法都会(随机地)选择一些特征以基于树开始拆分。 max_features确定需要选择多少个特征来确定拆分。

max_features ( image by author) max_features(作者提供的图片)

6. bootstrap (default = True)

6. 引导程序 ( 默认= True )

Once we provide the training data to the RandomForestClassifier model the algorithm selects a bunch of rows randomly with replacement to build the trees. This process is called Bootstrapping (Random replacement). If the bootstrap option is set to False, no random selection happens and the whole dataset is used to create the trees.

一旦我们提供的训练数据的RandomForestClassifier模型个随机与更换打造树木行电子算法选择了一堆。此过程称为自举(随机替换)。如果bootstrap选项设置为False ，则不会发生随机选择，并且整个数据集都用于创建树。