本次旨在与自己对于特征选择提取及机器学习进行一个简单的总结,更可以加深自己对于机器学习的步骤和原理的理解。
首先对于数据来说有两个文件,train.csv 测试集 和test.csv 训练集
import pandas as pd
train=pd.read_csv('./train.csv',index_col=0)
test=pd.read_csv('./test.csv',index_col=0)
首先第一步,查看一下数据质量,简单对数据进行一下数据探索
print(train.shape) #简单看一下训练个测试集的行数和字段数量
print(test.shape)
(1176, 35)
(294, 34)
再看一下数据的缺失值情况,对于线性回归来说,如果有缺失值需要进行补全或者删除,否则无法进行建模
print(train.isnull().sum()) #没有缺失值
print(test.isnull().sum() ) #没有缺失值
Age 0
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64
Age 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64
对数据进行详细描述性探索,主要为了特征选择所准备
#描述性统计
print(train.describe())
Age DailyRate DistanceFromHome Education EmployeeCount \
count 1176.000000 1176.000000 1176.000000 1176.000000 1176.0
mean 36.805272 802.033163 9.159864 2.918367 1.0
std 9.065549 405.946729 8.137224 1.009809 0.0
min 18.000000 104.000000 1.000000 1.000000 1.0
25% 30.000000 463.500000 2.000000 2.000000 1.0
50% 36.000000 805.500000 7.000000 3.000000 1.0
75% 42.250000 1162.000000 14.000000 4.000000 1.0
max 60.000000 1499.000000 29.000000 5.000000 1.0
EmployeeNumber EnvironmentSatisfaction HourlyRate JobInvolvement \
count 1176.000000 1176.000000 1176.000000 1176.000000
mean 1026.960034 2.750850 65.130102 2.724490
std 594.763609 1.096221 20.294326 0.715027
min 1.000000 1.000000 30.000000 1.000000
25% 498.750000 2.000000 48.000000 2.000000
50% 1031.000000 3.000000 65.000000 3.000000
75% 1555.250000 4.000000 82.250000 3.000000
max 2068.000000 4.000000 100.000000 4.000000
JobLevel JobSatisfaction MonthlyIncome MonthlyRate \
count 1176.000000 1176.000000 1176.000000 1176.000000
mean 2.055272 2.732993 6458.690476 14247.159864
std 1.106040 1.102477 4724.845883 7133.767499
min 1.000000 1.000000 1009.000000 2094.000000
25% 1.000000 2.000000 2858.750000 7912.750000
50% 2.000000 3.000000 4850.500000 14225.500000
75% 3.000000 4.000000 8380.250000 20372.500000
max 5.000000 4.000000 19999.000000 26999.000000
NumCompaniesWorked PercentSalaryHike PerformanceRating \
count 1176.000000 1176.000000 1176.000000
mean 2.703231 15.152211 3.150510
std 2.521301 3.652543 0.357723
min 0.000000 11.000000 3.000000
25% 1.000000 12.000000 3.000000
50% 2.000000 14.000000 3.000000
75% 4.000000 18.000000 3.000000
max 9.000000 25.000000 4.000000
RelationshipSatisfaction StandardHours StockOptionLevel \
count 1176.000000 1176.0 1176.000000
mean 2.714286 80.0 0.805272
std 1.080583 0.0 0.865611
min 1.000000 80.0 0.000000
25% 2.000000 80.0 0.000000
50% 3.000000 80.0 1.000000
75% 4.000000 80.0 1.000000
max 4.000000 80.0 3.000000
TotalWorkingYears TrainingTimesLastYear WorkLifeBalance \
count 1176.000000 1176.000000 1176.000000
mean 11.161565 2.767007 2.764456
std 7.747576 1.250756 0.713251
min 0.000000 0.000000 1.000000
25% 6.000000 2.000000 2.000000
50% 10.000000 3.000000 3.000000
75% 15.000000 3.000000 3.000000
max 40.000000 6.000000 4.000000
YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion \
count 1176.000000 1176.00000 1176.000000
mean 6.982143 4.19898 2.160714
std 6.094338 3.63124 3.208052
min 0.000000 0.00000 0.000000
25% 3.000000 2.00000 0.000000
50% 5.000000 3.00000 1.000000
75% 9.000000 7.00000 2.250000
max 40.000000 18.00000 15.000000
YearsWithCurrManager
count 1176.000000
mean 4.098639
std 3.564190
min 0.000000
25% 2.000000
50% 3.000000
75% 7.000000
max 17.000000
其中最重要的std参数,也就是方差,对于一些方差为0的字段,代表数值相同,并不具有建模意义
#先根据方差来看,先去掉方差为0的列,即上一步std = 0,因为方差为0代表数据完全一致,没有利用性,而且还会导致模型训练时间长,可以去掉
#Age的最小值是18,因此列Over18可删;另外员工工号列也可删除,因为在业务逻辑上,员工编号和离职没有必然联系,作为祖国新时代青年,这里不提玄学
train = train.drop(['EmployeeNumber', 'StandardHours','EmployeeCount','YearsInCurrentRole','YearsSinceLastPromotion','EmployeeNumber','Over18'], axis=1)
test = test.drop(['EmployeeNumber', 'StandardHours','EmployeeCount','YearsInCurrentRole','YearsSinceLastPromotion','EmployeeNumber','Over18'], axis=1)
根据方差进行完特征选择,我们还要查看特征之间的相关性,对于相关性特别强的特征,也就是pearson系数>0.8的,我们要保留一个就好
#查看变量相关性
import matplotlib.pyplot as plt
import seaborn as sns
corr = train.corr() # pandas直接调用corr就能计算特征之间的相关系数
sns.heatmap(corr,xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.show()
我们看图可得,JobLevel、MothlyIncome、TotalWorkingYears;PercentSalaryHike、PerformanceRating相关性很强,所以我们这里去掉JobLevel、TotalWorkingYears、PerformanceRating特征
#经过查看pearson系数图,如果系数>0.8,说明2个变量有明显线性关系,只保留一个,所以我们这里去掉JobLevel、TotalWorkingYears、PerformanceRating特征
train.drop(['JobLevel','TotalWorkingYears','PerformanceRating'],axis=1,inplace=True)
test.drop(['JobLevel','TotalWorkingYears','PerformanceRating'],axis=1,inplace=True)
到这里,我们的特征提取阶段已经完毕,接下来因为有些特征为非数值类型,而非数值类型是不能够参与运算的,所以我们要将非数值类型特征值通过LabelEncoder转换成数值类型参与计算
from sklearn.preprocessing import LabelEncoder
attr=['Age','BusinessTravel','Department','Education','EducationField','Gender','JobRole','MaritalStatus','OverTime']
lbe_list=[]
for feature in attr:
lbe=LabelEncoder()
train[feature]=lbe.fit_transform(train[feature])
test[feature]=lbe.transform(test[feature])
lbe_list.append(lbe)
# 处理Attrition字段
train['Attrition']=train['Attrition'].map(lambda x:1 if x=='Yes' else 0)
转换完成之后,个人很推荐将处理完毕的特征导出来文件,看一下自己有没有考虑不到位的情况,看一下特征处理的情况,感觉是个好习惯。
#将处理过的特征导出查看效果
train.to_csv('train_label_encoder.csv')
最后就是我们的建模训练阶段,这里我们采用线性回归LR模型
#建模 这里使用线性回归
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop('Attrition',axis=1), train['Attrition'], test_size=0.2, random_state=42)
这里也是本人感觉不是十分完美的情况,一开始我是使用默认的构造函数进行训练,但是报错,居然未收敛,所以我又设置了迭代次数,迭代终止误差范围,虽通过增加迭代次数最终通过了收敛,但是预测效果个人感觉不是很好,希望老师能够给出一点建议,哪里是我考虑不周的地方。谢谢大佬!
lr = LogisticRegression(max_iter=10000,
verbose=True,
random_state=33,
tol=1e-4)
lr.fit(X_train,y_train)
print('lr train score:{:.3f}'.format(lr.score(X_train,y_train)))
print('lr test score:{:.3f}'.format(lr.score(X_test,y_test)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
lr train score:0.869
lr test score:0.818
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.7s finished
最终为了可解释性,我们选出前5个系数最高的5个特征,查看一下影响离职率最高的因素
#这里查看系数最大的前5个变量,可得影响离职率的最大因素是:加班、婚姻状况、出差、职位及所属部门
import numpy as np
df_coef = pd.DataFrame(index=X_train.columns,data=np.transpose(logreg.coef_))
df_coef['abs'] = df_coef.iloc[:,0].abs()
df_coef = df_coef.sort_values(by='abs', ascending=False)
print(df_coef.head())
0 abs
OverTime 1.087580 1.087580
MaritalStatus 0.615751 0.615751
JobInvolvement -0.390932 0.390932
Department 0.338802 0.338802
Gender 0.311672 0.311672