本次旨在与自己对于特征选择提取及机器学习进行一个简单的总结,更可以加深自己对于机器学习的步骤和原理的理解。

首先对于数据来说有两个文件,train.csv 测试集 和test.csv  训练集

import pandas as pd

train=pd.read_csv('./train.csv',index_col=0)
test=pd.read_csv('./test.csv',index_col=0)

首先第一步,查看一下数据质量,简单对数据进行一下数据探索

print(train.shape)   #简单看一下训练个测试集的行数和字段数量
print(test.shape)   



(1176, 35)
(294, 34)

再看一下数据的缺失值情况,对于线性回归来说,如果有缺失值需要进行补全或者删除,否则无法进行建模

print(train.isnull().sum())     #没有缺失值
print(test.isnull().sum() )     #没有缺失值



Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64
Age                         0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

对数据进行详细描述性探索,主要为了特征选择所准备

#描述性统计
print(train.describe())


               Age    DailyRate  DistanceFromHome    Education  EmployeeCount  \
count  1176.000000  1176.000000       1176.000000  1176.000000         1176.0   
mean     36.805272   802.033163          9.159864     2.918367            1.0   
std       9.065549   405.946729          8.137224     1.009809            0.0   
min      18.000000   104.000000          1.000000     1.000000            1.0   
25%      30.000000   463.500000          2.000000     2.000000            1.0   
50%      36.000000   805.500000          7.000000     3.000000            1.0   
75%      42.250000  1162.000000         14.000000     4.000000            1.0   
max      60.000000  1499.000000         29.000000     5.000000            1.0   

       EmployeeNumber  EnvironmentSatisfaction   HourlyRate  JobInvolvement  \
count     1176.000000              1176.000000  1176.000000     1176.000000   
mean      1026.960034                 2.750850    65.130102        2.724490   
std        594.763609                 1.096221    20.294326        0.715027   
min          1.000000                 1.000000    30.000000        1.000000   
25%        498.750000                 2.000000    48.000000        2.000000   
50%       1031.000000                 3.000000    65.000000        3.000000   
75%       1555.250000                 4.000000    82.250000        3.000000   
max       2068.000000                 4.000000   100.000000        4.000000   

          JobLevel  JobSatisfaction  MonthlyIncome   MonthlyRate  \
count  1176.000000      1176.000000    1176.000000   1176.000000   
mean      2.055272         2.732993    6458.690476  14247.159864   
std       1.106040         1.102477    4724.845883   7133.767499   
min       1.000000         1.000000    1009.000000   2094.000000   
25%       1.000000         2.000000    2858.750000   7912.750000   
50%       2.000000         3.000000    4850.500000  14225.500000   
75%       3.000000         4.000000    8380.250000  20372.500000   
max       5.000000         4.000000   19999.000000  26999.000000   

       NumCompaniesWorked  PercentSalaryHike  PerformanceRating  \
count         1176.000000        1176.000000        1176.000000   
mean             2.703231          15.152211           3.150510   
std              2.521301           3.652543           0.357723   
min              0.000000          11.000000           3.000000   
25%              1.000000          12.000000           3.000000   
50%              2.000000          14.000000           3.000000   
75%              4.000000          18.000000           3.000000   
max              9.000000          25.000000           4.000000   

       RelationshipSatisfaction  StandardHours  StockOptionLevel  \
count               1176.000000         1176.0       1176.000000   
mean                   2.714286           80.0          0.805272   
std                    1.080583            0.0          0.865611   
min                    1.000000           80.0          0.000000   
25%                    2.000000           80.0          0.000000   
50%                    3.000000           80.0          1.000000   
75%                    4.000000           80.0          1.000000   
max                    4.000000           80.0          3.000000   

       TotalWorkingYears  TrainingTimesLastYear  WorkLifeBalance  \
count        1176.000000            1176.000000      1176.000000   
mean           11.161565               2.767007         2.764456   
std             7.747576               1.250756         0.713251   
min             0.000000               0.000000         1.000000   
25%             6.000000               2.000000         2.000000   
50%            10.000000               3.000000         3.000000   
75%            15.000000               3.000000         3.000000   
max            40.000000               6.000000         4.000000   

       YearsAtCompany  YearsInCurrentRole  YearsSinceLastPromotion  \
count     1176.000000          1176.00000              1176.000000   
mean         6.982143             4.19898                 2.160714   
std          6.094338             3.63124                 3.208052   
min          0.000000             0.00000                 0.000000   
25%          3.000000             2.00000                 0.000000   
50%          5.000000             3.00000                 1.000000   
75%          9.000000             7.00000                 2.250000   
max         40.000000            18.00000                15.000000   

       YearsWithCurrManager  
count           1176.000000  
mean               4.098639  
std                3.564190  
min                0.000000  
25%                2.000000  
50%                3.000000  
75%                7.000000  
max               17.000000

其中最重要的std参数,也就是方差,对于一些方差为0的字段,代表数值相同,并不具有建模意义

#先根据方差来看,先去掉方差为0的列,即上一步std = 0,因为方差为0代表数据完全一致,没有利用性,而且还会导致模型训练时间长,可以去掉
#Age的最小值是18,因此列Over18可删;另外员工工号列也可删除,因为在业务逻辑上,员工编号和离职没有必然联系,作为祖国新时代青年,这里不提玄学
train = train.drop(['EmployeeNumber', 'StandardHours','EmployeeCount','YearsInCurrentRole','YearsSinceLastPromotion','EmployeeNumber','Over18'], axis=1)
test = test.drop(['EmployeeNumber', 'StandardHours','EmployeeCount','YearsInCurrentRole','YearsSinceLastPromotion','EmployeeNumber','Over18'], axis=1)

根据方差进行完特征选择,我们还要查看特征之间的相关性,对于相关性特别强的特征,也就是pearson系数>0.8的,我们要保留一个就好

#查看变量相关性
import matplotlib.pyplot as plt
import seaborn as sns
corr = train.corr() # pandas直接调用corr就能计算特征之间的相关系数
sns.heatmap(corr,xticklabels=corr.columns.values, yticklabels=corr.columns.values)
plt.show()

员工数据分析与离职预测 员工离职率模型_机器学习

我们看图可得,JobLevel、MothlyIncome、TotalWorkingYears;PercentSalaryHike、PerformanceRating相关性很强,所以我们这里去掉JobLevel、TotalWorkingYears、PerformanceRating特征

#经过查看pearson系数图,如果系数>0.8,说明2个变量有明显线性关系,只保留一个,所以我们这里去掉JobLevel、TotalWorkingYears、PerformanceRating特征
train.drop(['JobLevel','TotalWorkingYears','PerformanceRating'],axis=1,inplace=True)
test.drop(['JobLevel','TotalWorkingYears','PerformanceRating'],axis=1,inplace=True)

到这里,我们的特征提取阶段已经完毕,接下来因为有些特征为非数值类型,而非数值类型是不能够参与运算的,所以我们要将非数值类型特征值通过LabelEncoder转换成数值类型参与计算

 

from sklearn.preprocessing import LabelEncoder
attr=['Age','BusinessTravel','Department','Education','EducationField','Gender','JobRole','MaritalStatus','OverTime']
lbe_list=[]
for feature in attr:
    lbe=LabelEncoder()
    train[feature]=lbe.fit_transform(train[feature])
    test[feature]=lbe.transform(test[feature])
    lbe_list.append(lbe)
# 处理Attrition字段
train['Attrition']=train['Attrition'].map(lambda x:1 if x=='Yes' else 0)

转换完成之后,个人很推荐将处理完毕的特征导出来文件,看一下自己有没有考虑不到位的情况,看一下特征处理的情况,感觉是个好习惯。

#将处理过的特征导出查看效果
train.to_csv('train_label_encoder.csv')

最后就是我们的建模训练阶段,这里我们采用线性回归LR模型

#建模  这里使用线性回归
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop('Attrition',axis=1), train['Attrition'], test_size=0.2, random_state=42)

这里也是本人感觉不是十分完美的情况,一开始我是使用默认的构造函数进行训练,但是报错,居然未收敛,所以我又设置了迭代次数,迭代终止误差范围,虽通过增加迭代次数最终通过了收敛,但是预测效果个人感觉不是很好,希望老师能够给出一点建议,哪里是我考虑不周的地方。谢谢大佬!

lr = LogisticRegression(max_iter=10000, 
                           verbose=True, 
                           random_state=33,
                           tol=1e-4)
lr.fit(X_train,y_train)
print('lr train score:{:.3f}'.format(lr.score(X_train,y_train)))
print('lr test score:{:.3f}'.format(lr.score(X_test,y_test)))    



[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
lr train score:0.869
lr test score:0.818
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished

最终为了可解释性,我们选出前5个系数最高的5个特征,查看一下影响离职率最高的因素

#这里查看系数最大的前5个变量,可得影响离职率的最大因素是:加班、婚姻状况、出差、职位及所属部门
import numpy as np
df_coef = pd.DataFrame(index=X_train.columns,data=np.transpose(logreg.coef_))
df_coef['abs'] = df_coef.iloc[:,0].abs()
df_coef = df_coef.sort_values(by='abs', ascending=False)
print(df_coef.head())



                       0       abs
OverTime        1.087580  1.087580
MaritalStatus   0.615751  0.615751
JobInvolvement -0.390932  0.390932
Department      0.338802  0.338802
Gender          0.311672  0.311672