特征共线是否对随机森林模型的预测性能有影响?

我们为什么关注特征共线?

特征共线就是指数据集中的特征之间匹配得太好或特征高度相关,例如:降雨量和乌云云团大小、织物纤维和吸水能力等;

然而,在机器学习模型中,特征共线是一件坏事。它可能造成模型偏向于某些特征,而导致信息丢失,尤其是在多特征回归任务中更是如此。

实际上,特征共线对随机森林模型并没有影响。这里将对特征共线对随机森林模型的影响进行讨论。

下面是本文的一些参考链接:

参考链接1

参考链接2

# 工具包导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import warnings
warnings.filterwarnings('ignore')
# 显示当前工作目录
%pwd
'D:\\python code\\9日常\\--------20210723特征共线对随机森林模型的影响--------\\TheDataVolcano-master'
# 载入数据,以下是数据地址 
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('./datasets/State_of_New_York_Mortgage_Agency.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Bond Series              28528 non-null  object
 1   Original Loan Amount     28528 non-null  object
 2   Loan Purchase Date       28528 non-null  object
 3   Purchase Year            28528 non-null  int64 
 4   Original Loan To Value   28528 non-null  object
 5   Loan Type                28528 non-null  object
 6   SONYMA DPAL/CCAL Amount  21012 non-null  object
 7   Original Term            28528 non-null  int64 
 8   County                   28528 non-null  object
 9   FIPS Code                28528 non-null  int64 
 10  Number of Units          28528 non-null  object
 11  Property Type            28528 non-null  object
 12  Housing Type             28528 non-null  object
 13  Household Size           28528 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 3.0+ MB
df.head(10)



Bond Series

Original Loan Amount

Loan Purchase Date

Purchase Year

Original Loan To Value

Loan Type

SONYMA DPAL/CCAL Amount

Original Term

County

FIPS Code

Number of Units

Property Type

Housing Type

Household Size

0

Series 109/110

$32470

01/02/2004

2004

97%

Conventional

$2933

360

Monroe

36055

1 Family

Detached

Existing

1

1

Series 109/110

$48500

01/02/2004

2004

97%

Conventional

$3435

360

Genesee

36037

1 Family

Detached

Existing

4

2

Series 109/110

$49470

01/02/2004

2004

97%

Conventional

$4996

360

Monroe

36055

1 Family

Detached

Existing

3

3

Series 109/110

$58200

01/02/2004

2004

97%

Conventional

$4170

360

Erie

36029

1 Family

Detached

Existing

2

4

Series 109/110

$64990

01/02/2004

2004

97%

Conventional

$4940

360

Erie

36029

1 Family

Detached

Existing

3

5

Series 109/110

$64990

01/02/2004

2004

97%

Conventional

$4772

360

Schenectady

36093

1 Family

Detached

Existing

1

6

Series 109/110

$67900

01/02/2004

2004

97%

Conventional

$5000

360

Orleans

36073

1 Family

Detached

Existing

3

7

Series 109/110

$67900

01/02/2004

2004

97%

Conventional

$4845

360

Wayne

36117

1 Family

Detached

Existing

2

8

Series 109/110

$72775

01/02/2004

2004

97%

Conventional

$5000

360

Monroe

36055

1 Family

Detached

Existing

2

9

Series 109/110

$77115

01/02/2004

2004

97%

Conventional

$5000

360

Broome

36007

1 Family

Detached

Existing

2

df.columns
Index(['Bond Series', 'Original Loan Amount', 'Loan Purchase Date ',
       'Purchase Year', 'Original Loan To Value', 'Loan Type ',
       'SONYMA DPAL/CCAL Amount', 'Original Term', 'County', 'FIPS Code',
       'Number of Units', 'Property Type', 'Housing Type', 'Household Size '],
      dtype='object')
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]

# turn off warnings on the slice operation we do below. 
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None 

# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
                                                                              index=stacked.index).unstack()

# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)

# need to convert to float
dfmod=dfmod.astype(float)

查看原始数据集中相关的特征是哪些?

dfmod.columns
Index(['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',
       'Property Type', 'County', 'Housing Type', 'Bond Series',
       'Original Term'],
      dtype='object')
dfmod.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Original Loan Amount     28528 non-null  float64
 1   Purchase Year            28528 non-null  float64
 2   Original Loan To Value   28528 non-null  float64
 3   SONYMA DPAL/CCAL Amount  21012 non-null  float64
 4   Number of Units          28528 non-null  float64
 5   Household Size           28528 non-null  float64
 6   Property Type            28528 non-null  float64
 7   County                   28528 non-null  float64
 8   Housing Type             28528 non-null  float64
 9   Bond Series              28528 non-null  float64
 10  Original Term            28528 non-null  float64
dtypes: float64(11)
memory usage: 2.4 MB
dfmod.head(10)



Original Loan Amount

Purchase Year

Original Loan To Value

SONYMA DPAL/CCAL Amount

Number of Units

Household Size

Property Type

County

Housing Type

Bond Series

Original Term

0

32470.0

2004.0

97.0

2933.0

1.0

1.0

0.0

1.0

2.0

3.0

360.0

1

48500.0

2004.0

97.0

3435.0

1.0

4.0

0.0

4.0

2.0

3.0

360.0

2

49470.0

2004.0

97.0

4996.0

1.0

3.0

0.0

1.0

2.0

3.0

360.0

3

58200.0

2004.0

97.0

4170.0

1.0

2.0

0.0

5.0

2.0

3.0

360.0

4

64990.0

2004.0

97.0

4940.0

1.0

3.0

0.0

5.0

2.0

3.0

360.0

5

64990.0

2004.0

97.0

4772.0

1.0

1.0

0.0

6.0

2.0

3.0

360.0

6

67900.0

2004.0

97.0

5000.0

1.0

3.0

0.0

7.0

2.0

3.0

360.0

7

67900.0

2004.0

97.0

4845.0

1.0

2.0

0.0

8.0

2.0

3.0

360.0

8

72775.0

2004.0

97.0

5000.0

1.0

2.0

0.0

1.0

2.0

3.0

360.0

9

77115.0

2004.0

97.0

5000.0

1.0

2.0

0.0

9.0

2.0

3.0

360.0

# 绘制热力图
from mlxtend.plotting import heatmap

cols = ['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',\
       'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',\
       'Property Type', 'County', 'Housing Type', 'Bond Series',\
       'Original Term']
cm = np.corrcoef(dfmod[cols].values.T)
"""
下图中的nan出现的原因是,'SONYMA DPAL/CCAL Amount'含有空值null
"""
hm = heatmap(cm, row_names=cols, column_names=cols, figsize=(12, 12))

# 保存图表
plt.savefig('./heatmaps.png', dpi=300)
plt.show()

为什么随机森林处理非线性效果不好_随机森林模型

# test for correlations 
corrDF=dfmod.corr()
corrDF



Original Loan Amount

Purchase Year

Original Loan To Value

SONYMA DPAL/CCAL Amount

Number of Units

Household Size

Property Type

County

Housing Type

Bond Series

Original Term

Original Loan Amount

1.000000

0.337831

-0.056902

0.662054

0.112723

0.238369

0.101085

0.232890

0.124133

0.325947

0.184459

Purchase Year

0.337831

1.000000

-0.152347

-0.062682

-0.005365

0.073343

0.149763

0.105100

0.113273

0.922574

0.058512

Original Loan To Value

-0.056902

-0.152347

1.000000

-0.091755

0.028671

-0.033281

-0.294189

-0.189966

-0.294904

-0.167013

-0.002098

SONYMA DPAL/CCAL Amount

0.662054

-0.062682

-0.091755

1.000000

0.050282

0.202176

0.025931

0.160047

0.152185

-0.078872

0.199689

Number of Units

0.112723

-0.005365

0.028671

0.050282

1.000000

-0.004223

-0.003898

-0.027416

0.013539

-0.006489

-0.002910

Household Size

0.238369

0.073343

-0.033281

0.202176

-0.004223

1.000000

-0.043792

0.107912

0.085088

0.073733

0.076269

Property Type

0.101085

0.149763

-0.294189

0.025931

-0.003898

-0.043792

1.000000

0.197686

0.224163

0.158472

0.030414

County

0.232890

0.105100

-0.189966

0.160047

-0.027416

0.107912

0.197686

1.000000

0.174262

0.115525

0.050321

Housing Type

0.124133

0.113273

-0.294904

0.152185

0.013539

0.085088

0.224163

0.174262

1.000000

0.134284

0.046703

Bond Series

0.325947

0.922574

-0.167013

-0.078872

-0.006489

0.073733

0.158472

0.115525

0.134284

1.000000

0.054699

Original Term

0.184459

0.058512

-0.002098

0.199689

-0.002910

0.076269

0.030414

0.050321

0.046703

0.054699

1.000000

从上表可以看出,特征之间并没有呈现出高度相关性,特征’Original Loan Amount’和特征 'SONYMA DPAL/CCAL Amount’相关性系数达到了0.66.

制造一些相关性很强的假数据

生成新的数据列’Grandmas Loan Agency’ ,它与列 'SONYMA DPAL/CCAL Amount’高度相关,数据展示如下:

randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF



Original Loan Amount

Purchase Year

Original Loan To Value

SONYMA DPAL/CCAL Amount

Number of Units

Household Size

Property Type

County

Housing Type

Bond Series

Original Term

Grandmas Loan Agency

Original Loan Amount

1.000000

0.337831

-0.056902

0.662054

0.112723

0.238369

0.101085

0.232890

0.124133

0.325947

0.184459

0.683658

Purchase Year

0.337831

1.000000

-0.152347

-0.062682

-0.005365

0.073343

0.149763

0.105100

0.113273

0.922574

0.058512

0.030616

Original Loan To Value

-0.056902

-0.152347

1.000000

-0.091755

0.028671

-0.033281

-0.294189

-0.189966

-0.294904

-0.167013

-0.002098

-0.105809

SONYMA DPAL/CCAL Amount

0.662054

-0.062682

-0.091755

1.000000

0.050282

0.202176

0.025931

0.160047

0.152185

-0.078872

0.199689

0.993475

Number of Units

0.112723

-0.005365

0.028671

0.050282

1.000000

-0.004223

-0.003898

-0.027416

0.013539

-0.006489

-0.002910

0.048050

Household Size

0.238369

0.073343

-0.033281

0.202176

-0.004223

1.000000

-0.043792

0.107912

0.085088

0.073733

0.076269

0.207818

Property Type

0.101085

0.149763

-0.294189

0.025931

-0.003898

-0.043792

1.000000

0.197686

0.224163

0.158472

0.030414

0.038477

County

0.232890

0.105100

-0.189966

0.160047

-0.027416

0.107912

0.197686

1.000000

0.174262

0.115525

0.050321

0.167676

Housing Type

0.124133

0.113273

-0.294904

0.152185

0.013539

0.085088

0.224163

0.174262

1.000000

0.134284

0.046703

0.167975

Bond Series

0.325947

0.922574

-0.167013

-0.078872

-0.006489

0.073733

0.158472

0.115525

0.134284

1.000000

0.054699

0.009896

Original Term

0.184459

0.058512

-0.002098

0.199689

-0.002910

0.076269

0.030414

0.050321

0.046703

0.054699

1.000000

0.207590

Grandmas Loan Agency

0.683658

0.030616

-0.105809

0.993475

0.048050

0.207818

0.038477

0.167676

0.167975

0.009896

0.207590

1.000000

构建一个随机森林模型

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split

def ourModel(data, result):  
    # inputs
    # data = pandas data frame (x)
    # results = column of desired result (y)
    
    
    # split the test - train set 
    X_train, X_test, y_train, y_test = train_test_split(
    data , result , test_size=0.25, random_state=1)

    # setup the model
    clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
    clf.fit(X_train, y_train)
    predictions=clf.predict(X_test)
    print('r2: ' + str(metrics.r2_score(predictions, y_test)))
    print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
    
    # feature importance
    importances=clf.feature_importances_
    indices = np.argsort(importances)
    fp=zip(data.columns.values[indices], importances[indices])
    
    return(fp)

执行上述函数

dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)

data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
WITH MADE UP DATA
r2: 0.8805216008148813
mse: 637577470.5899562

WITHOUT MADE UP DATA
r2: 0.8784191345199206
mse: 645873903.6564941

构造一些特征以图进一步改进模型

# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]


# what else can we add? Maybe how many houses were purchased in that year 
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]

#  and just because we don't want all our new data depending on year, let's do one about
#  expected wealth by family size in NY
#  source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
#  assume > 4 is = 4

# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
    > 4 else 1 for x in range(len(dfmod['Household Size '])) ]

#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)

data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)

# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))

print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))
WITH MADE UP DATA, Round 2!
r2: 0.8805119731776188
mse: 636127397.198097

WITHOUT MADE UP DATA, Round 2!
r2: 0.87777395649925
mse: 650974675.1996832

Features in order of importance for fake:
[('Original Term', 0.001193970255686264), ('income', 0.001962303252954665), ('housesBought', 0.0032933784659805597), ('Number of Units', 0.004629807343641515), ('Housing Type', 0.005638302615442727), ('Household Size ', 0.00818667858229475), ('Purchase Year', 0.010329245640254558), ('mort', 0.01719017704279596), ('Bond Series', 0.01786888405078895), ('Property Type', 0.02191737837312151), ('Original Loan To Value', 0.06067438107071076), ('County', 0.060777486720751416), ('SONYMA DPAL/CCAL Amount', 0.08007650880918302), ('Grandmas Loan Agency', 0.7062614977763934)]

Features in order of importance for NOT fake:
[('Original Term', 0.001331898105323796), ('Number of Units', 0.004787708555639128), ('Housing Type', 0.005980303169527446), ('Household Size ', 0.010415186933312198), ('Property Type', 0.022150192124642493), ('Bond Series', 0.02455088805708598), ('Purchase Year', 0.039911871359296906), ('Original Loan To Value', 0.062332516902338694), ('County', 0.064640597446016), ('SONYMA DPAL/CCAL Amount', 0.7638988373468174)]

可以看到,特征’Grandmas Loan Agency’具有较高的重要性指数的时候,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数较低,大约为0.08;

当删除了特征’Grandmas Loan Agency’,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数为 0.7638988373468174。总结解释如下:

随机森林模型的预测能力不受多重共线性的影响

但是数据的解释性会被多重共线影响。随机森林模型可以返回特征的重要性指数,如果存在多重共线,则importance会被影响。一些具有多重共线的特征的重要性会被相互抵消,从而影响我们解释和理解特征。

一种简单的理解:多重共线性的特征不会对决策树、随机森林的预测能力有影响。

多重共线性最极端的情况是有两个完全一样的特征,特征A和特征B。当特征A被使用之后,决策树不会再选择使用特征B,因为特征B并没有增加新的有效信息。同理,如何决策树先选择了使用特征B,那么特征A也不会再被使用。