特征共线是否对随机森林模型的预测性能有影响?
我们为什么关注特征共线?
特征共线就是指数据集中的特征之间匹配得太好或特征高度相关,例如:降雨量和乌云云团大小、织物纤维和吸水能力等;
然而,在机器学习模型中,特征共线是一件坏事。它可能造成模型偏向于某些特征,而导致信息丢失,尤其是在多特征回归任务中更是如此。
实际上,特征共线对随机森林模型并没有影响。这里将对特征共线对随机森林模型的影响进行讨论。
下面是本文的一些参考链接:
# 工具包导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')
# 显示当前工作目录
%pwd
'D:\\python code\\9日常\\--------20210723特征共线对随机森林模型的影响--------\\TheDataVolcano-master'
# 载入数据,以下是数据地址
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('./datasets/State_of_New_York_Mortgage_Agency.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bond Series 28528 non-null object
1 Original Loan Amount 28528 non-null object
2 Loan Purchase Date 28528 non-null object
3 Purchase Year 28528 non-null int64
4 Original Loan To Value 28528 non-null object
5 Loan Type 28528 non-null object
6 SONYMA DPAL/CCAL Amount 21012 non-null object
7 Original Term 28528 non-null int64
8 County 28528 non-null object
9 FIPS Code 28528 non-null int64
10 Number of Units 28528 non-null object
11 Property Type 28528 non-null object
12 Housing Type 28528 non-null object
13 Household Size 28528 non-null int64
dtypes: int64(4), object(10)
memory usage: 3.0+ MB
df.head(10)
Bond Series | Original Loan Amount | Loan Purchase Date | Purchase Year | Original Loan To Value | Loan Type | SONYMA DPAL/CCAL Amount | Original Term | County | FIPS Code | Number of Units | Property Type | Housing Type | Household Size | |
0 | Series 109/110 | $32470 | 01/02/2004 | 2004 | 97% | Conventional | $2933 | 360 | Monroe | 36055 | 1 Family | Detached | Existing | 1 |
1 | Series 109/110 | $48500 | 01/02/2004 | 2004 | 97% | Conventional | $3435 | 360 | Genesee | 36037 | 1 Family | Detached | Existing | 4 |
2 | Series 109/110 | $49470 | 01/02/2004 | 2004 | 97% | Conventional | $4996 | 360 | Monroe | 36055 | 1 Family | Detached | Existing | 3 |
3 | Series 109/110 | $58200 | 01/02/2004 | 2004 | 97% | Conventional | $4170 | 360 | Erie | 36029 | 1 Family | Detached | Existing | 2 |
4 | Series 109/110 | $64990 | 01/02/2004 | 2004 | 97% | Conventional | $4940 | 360 | Erie | 36029 | 1 Family | Detached | Existing | 3 |
5 | Series 109/110 | $64990 | 01/02/2004 | 2004 | 97% | Conventional | $4772 | 360 | Schenectady | 36093 | 1 Family | Detached | Existing | 1 |
6 | Series 109/110 | $67900 | 01/02/2004 | 2004 | 97% | Conventional | $5000 | 360 | Orleans | 36073 | 1 Family | Detached | Existing | 3 |
7 | Series 109/110 | $67900 | 01/02/2004 | 2004 | 97% | Conventional | $4845 | 360 | Wayne | 36117 | 1 Family | Detached | Existing | 2 |
8 | Series 109/110 | $72775 | 01/02/2004 | 2004 | 97% | Conventional | $5000 | 360 | Monroe | 36055 | 1 Family | Detached | Existing | 2 |
9 | Series 109/110 | $77115 | 01/02/2004 | 2004 | 97% | Conventional | $5000 | 360 | Broome | 36007 | 1 Family | Detached | Existing | 2 |
df.columns
Index(['Bond Series', 'Original Loan Amount', 'Loan Purchase Date ',
'Purchase Year', 'Original Loan To Value', 'Loan Type ',
'SONYMA DPAL/CCAL Amount', 'Original Term', 'County', 'FIPS Code',
'Number of Units', 'Property Type', 'Housing Type', 'Household Size '],
dtype='object')
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]
# turn off warnings on the slice operation we do below.
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None
# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
index=stacked.index).unstack()
# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)
# need to convert to float
dfmod=dfmod.astype(float)
查看原始数据集中相关的特征是哪些?
dfmod.columns
Index(['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',
'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',
'Property Type', 'County', 'Housing Type', 'Bond Series',
'Original Term'],
dtype='object')
dfmod.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28528 entries, 0 to 28527
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Original Loan Amount 28528 non-null float64
1 Purchase Year 28528 non-null float64
2 Original Loan To Value 28528 non-null float64
3 SONYMA DPAL/CCAL Amount 21012 non-null float64
4 Number of Units 28528 non-null float64
5 Household Size 28528 non-null float64
6 Property Type 28528 non-null float64
7 County 28528 non-null float64
8 Housing Type 28528 non-null float64
9 Bond Series 28528 non-null float64
10 Original Term 28528 non-null float64
dtypes: float64(11)
memory usage: 2.4 MB
dfmod.head(10)
Original Loan Amount | Purchase Year | Original Loan To Value | SONYMA DPAL/CCAL Amount | Number of Units | Household Size | Property Type | County | Housing Type | Bond Series | Original Term | |
0 | 32470.0 | 2004.0 | 97.0 | 2933.0 | 1.0 | 1.0 | 0.0 | 1.0 | 2.0 | 3.0 | 360.0 |
1 | 48500.0 | 2004.0 | 97.0 | 3435.0 | 1.0 | 4.0 | 0.0 | 4.0 | 2.0 | 3.0 | 360.0 |
2 | 49470.0 | 2004.0 | 97.0 | 4996.0 | 1.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 360.0 |
3 | 58200.0 | 2004.0 | 97.0 | 4170.0 | 1.0 | 2.0 | 0.0 | 5.0 | 2.0 | 3.0 | 360.0 |
4 | 64990.0 | 2004.0 | 97.0 | 4940.0 | 1.0 | 3.0 | 0.0 | 5.0 | 2.0 | 3.0 | 360.0 |
5 | 64990.0 | 2004.0 | 97.0 | 4772.0 | 1.0 | 1.0 | 0.0 | 6.0 | 2.0 | 3.0 | 360.0 |
6 | 67900.0 | 2004.0 | 97.0 | 5000.0 | 1.0 | 3.0 | 0.0 | 7.0 | 2.0 | 3.0 | 360.0 |
7 | 67900.0 | 2004.0 | 97.0 | 4845.0 | 1.0 | 2.0 | 0.0 | 8.0 | 2.0 | 3.0 | 360.0 |
8 | 72775.0 | 2004.0 | 97.0 | 5000.0 | 1.0 | 2.0 | 0.0 | 1.0 | 2.0 | 3.0 | 360.0 |
9 | 77115.0 | 2004.0 | 97.0 | 5000.0 | 1.0 | 2.0 | 0.0 | 9.0 | 2.0 | 3.0 | 360.0 |
# 绘制热力图
from mlxtend.plotting import heatmap
cols = ['Original Loan Amount', 'Purchase Year', 'Original Loan To Value',\
'SONYMA DPAL/CCAL Amount', 'Number of Units', 'Household Size ',\
'Property Type', 'County', 'Housing Type', 'Bond Series',\
'Original Term']
cm = np.corrcoef(dfmod[cols].values.T)
"""
下图中的nan出现的原因是,'SONYMA DPAL/CCAL Amount'含有空值null
"""
hm = heatmap(cm, row_names=cols, column_names=cols, figsize=(12, 12))
# 保存图表
plt.savefig('./heatmaps.png', dpi=300)
plt.show()
# test for correlations
corrDF=dfmod.corr()
corrDF
Original Loan Amount | Purchase Year | Original Loan To Value | SONYMA DPAL/CCAL Amount | Number of Units | Household Size | Property Type | County | Housing Type | Bond Series | Original Term | |
Original Loan Amount | 1.000000 | 0.337831 | -0.056902 | 0.662054 | 0.112723 | 0.238369 | 0.101085 | 0.232890 | 0.124133 | 0.325947 | 0.184459 |
Purchase Year | 0.337831 | 1.000000 | -0.152347 | -0.062682 | -0.005365 | 0.073343 | 0.149763 | 0.105100 | 0.113273 | 0.922574 | 0.058512 |
Original Loan To Value | -0.056902 | -0.152347 | 1.000000 | -0.091755 | 0.028671 | -0.033281 | -0.294189 | -0.189966 | -0.294904 | -0.167013 | -0.002098 |
SONYMA DPAL/CCAL Amount | 0.662054 | -0.062682 | -0.091755 | 1.000000 | 0.050282 | 0.202176 | 0.025931 | 0.160047 | 0.152185 | -0.078872 | 0.199689 |
Number of Units | 0.112723 | -0.005365 | 0.028671 | 0.050282 | 1.000000 | -0.004223 | -0.003898 | -0.027416 | 0.013539 | -0.006489 | -0.002910 |
Household Size | 0.238369 | 0.073343 | -0.033281 | 0.202176 | -0.004223 | 1.000000 | -0.043792 | 0.107912 | 0.085088 | 0.073733 | 0.076269 |
Property Type | 0.101085 | 0.149763 | -0.294189 | 0.025931 | -0.003898 | -0.043792 | 1.000000 | 0.197686 | 0.224163 | 0.158472 | 0.030414 |
County | 0.232890 | 0.105100 | -0.189966 | 0.160047 | -0.027416 | 0.107912 | 0.197686 | 1.000000 | 0.174262 | 0.115525 | 0.050321 |
Housing Type | 0.124133 | 0.113273 | -0.294904 | 0.152185 | 0.013539 | 0.085088 | 0.224163 | 0.174262 | 1.000000 | 0.134284 | 0.046703 |
Bond Series | 0.325947 | 0.922574 | -0.167013 | -0.078872 | -0.006489 | 0.073733 | 0.158472 | 0.115525 | 0.134284 | 1.000000 | 0.054699 |
Original Term | 0.184459 | 0.058512 | -0.002098 | 0.199689 | -0.002910 | 0.076269 | 0.030414 | 0.050321 | 0.046703 | 0.054699 | 1.000000 |
从上表可以看出,特征之间并没有呈现出高度相关性,特征’Original Loan Amount’和特征 'SONYMA DPAL/CCAL Amount’相关性系数达到了0.66.
制造一些相关性很强的假数据
生成新的数据列’Grandmas Loan Agency’ ,它与列 'SONYMA DPAL/CCAL Amount’高度相关,数据展示如下:
randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF
Original Loan Amount | Purchase Year | Original Loan To Value | SONYMA DPAL/CCAL Amount | Number of Units | Household Size | Property Type | County | Housing Type | Bond Series | Original Term | Grandmas Loan Agency | |
Original Loan Amount | 1.000000 | 0.337831 | -0.056902 | 0.662054 | 0.112723 | 0.238369 | 0.101085 | 0.232890 | 0.124133 | 0.325947 | 0.184459 | 0.683658 |
Purchase Year | 0.337831 | 1.000000 | -0.152347 | -0.062682 | -0.005365 | 0.073343 | 0.149763 | 0.105100 | 0.113273 | 0.922574 | 0.058512 | 0.030616 |
Original Loan To Value | -0.056902 | -0.152347 | 1.000000 | -0.091755 | 0.028671 | -0.033281 | -0.294189 | -0.189966 | -0.294904 | -0.167013 | -0.002098 | -0.105809 |
SONYMA DPAL/CCAL Amount | 0.662054 | -0.062682 | -0.091755 | 1.000000 | 0.050282 | 0.202176 | 0.025931 | 0.160047 | 0.152185 | -0.078872 | 0.199689 | 0.993475 |
Number of Units | 0.112723 | -0.005365 | 0.028671 | 0.050282 | 1.000000 | -0.004223 | -0.003898 | -0.027416 | 0.013539 | -0.006489 | -0.002910 | 0.048050 |
Household Size | 0.238369 | 0.073343 | -0.033281 | 0.202176 | -0.004223 | 1.000000 | -0.043792 | 0.107912 | 0.085088 | 0.073733 | 0.076269 | 0.207818 |
Property Type | 0.101085 | 0.149763 | -0.294189 | 0.025931 | -0.003898 | -0.043792 | 1.000000 | 0.197686 | 0.224163 | 0.158472 | 0.030414 | 0.038477 |
County | 0.232890 | 0.105100 | -0.189966 | 0.160047 | -0.027416 | 0.107912 | 0.197686 | 1.000000 | 0.174262 | 0.115525 | 0.050321 | 0.167676 |
Housing Type | 0.124133 | 0.113273 | -0.294904 | 0.152185 | 0.013539 | 0.085088 | 0.224163 | 0.174262 | 1.000000 | 0.134284 | 0.046703 | 0.167975 |
Bond Series | 0.325947 | 0.922574 | -0.167013 | -0.078872 | -0.006489 | 0.073733 | 0.158472 | 0.115525 | 0.134284 | 1.000000 | 0.054699 | 0.009896 |
Original Term | 0.184459 | 0.058512 | -0.002098 | 0.199689 | -0.002910 | 0.076269 | 0.030414 | 0.050321 | 0.046703 | 0.054699 | 1.000000 | 0.207590 |
Grandmas Loan Agency | 0.683658 | 0.030616 | -0.105809 | 0.993475 | 0.048050 | 0.207818 | 0.038477 | 0.167676 | 0.167975 | 0.009896 | 0.207590 | 1.000000 |
构建一个随机森林模型
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split
def ourModel(data, result):
# inputs
# data = pandas data frame (x)
# results = column of desired result (y)
# split the test - train set
X_train, X_test, y_train, y_test = train_test_split(
data , result , test_size=0.25, random_state=1)
# setup the model
clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
clf.fit(X_train, y_train)
predictions=clf.predict(X_test)
print('r2: ' + str(metrics.r2_score(predictions, y_test)))
print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
# feature importance
importances=clf.feature_importances_
indices = np.argsort(importances)
fp=zip(data.columns.values[indices], importances[indices])
return(fp)
执行上述函数
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)
print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)
data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
WITH MADE UP DATA
r2: 0.8805216008148813
mse: 637577470.5899562
WITHOUT MADE UP DATA
r2: 0.8784191345199206
mse: 645873903.6564941
构造一些特征以图进一步改进模型
# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]
# what else can we add? Maybe how many houses were purchased in that year
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]
# and just because we don't want all our new data depending on year, let's do one about
# expected wealth by family size in NY
# source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
# assume > 4 is = 4
# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
> 4 else 1 for x in range(len(dfmod['Household Size '])) ]
#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)
print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)
data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)
# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))
print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))
WITH MADE UP DATA, Round 2!
r2: 0.8805119731776188
mse: 636127397.198097
WITHOUT MADE UP DATA, Round 2!
r2: 0.87777395649925
mse: 650974675.1996832
Features in order of importance for fake:
[('Original Term', 0.001193970255686264), ('income', 0.001962303252954665), ('housesBought', 0.0032933784659805597), ('Number of Units', 0.004629807343641515), ('Housing Type', 0.005638302615442727), ('Household Size ', 0.00818667858229475), ('Purchase Year', 0.010329245640254558), ('mort', 0.01719017704279596), ('Bond Series', 0.01786888405078895), ('Property Type', 0.02191737837312151), ('Original Loan To Value', 0.06067438107071076), ('County', 0.060777486720751416), ('SONYMA DPAL/CCAL Amount', 0.08007650880918302), ('Grandmas Loan Agency', 0.7062614977763934)]
Features in order of importance for NOT fake:
[('Original Term', 0.001331898105323796), ('Number of Units', 0.004787708555639128), ('Housing Type', 0.005980303169527446), ('Household Size ', 0.010415186933312198), ('Property Type', 0.022150192124642493), ('Bond Series', 0.02455088805708598), ('Purchase Year', 0.039911871359296906), ('Original Loan To Value', 0.062332516902338694), ('County', 0.064640597446016), ('SONYMA DPAL/CCAL Amount', 0.7638988373468174)]
可以看到,特征’Grandmas Loan Agency’具有较高的重要性指数的时候,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数较低,大约为0.08;
当删除了特征’Grandmas Loan Agency’,对应的特征’SONYMA DPAL/CCAL Amount’重要性指数为 0.7638988373468174。总结解释如下:
随机森林模型的预测能力不受多重共线性的影响。
但是数据的解释性会被多重共线影响。随机森林模型可以返回特征的重要性指数,如果存在多重共线,则importance会被影响。一些具有多重共线的特征的重要性会被相互抵消,从而影响我们解释和理解特征。
一种简单的理解:多重共线性的特征不会对决策树、随机森林的预测能力有影响。
多重共线性最极端的情况是有两个完全一样的特征,特征A和特征B。当特征A被使用之后,决策树不会再选择使用特征B,因为特征B并没有增加新的有效信息。同理,如何决策树先选择了使用特征B,那么特征A也不会再被使用。