使用机器学习方法预测IBM员工流失数据模型

转载

SAP虾客 2019-01-08 08:03:10

文章标签 机器学习 文章分类 机器学习人工智能

在 IT公司，有许多员工架构可用。一些IT公司或特定部门或特定级别遵循主要的程序员结构，其中有一个“start”组织围绕一个“chief”职位，指定给最了解系统要求的工程师。

使用机器学习方法预测IBM员工流失数据模型_机器学习

首席程序员架构

然而，有些人遵循民主结构，所有工程师都处于同一级别，指定用于不同的工作，如前端设计，后端编码，软件测试等。但是，这种架构并不是一些知名大公司所使用的。但总而言之，这是一个非常成功且有效的环境稳定型架构。

使用机器学习方法预测IBM员工流失数据模型_机器学习_02

程序员民主架构

第三类架构是混合结构，它是上述两种类型的组合。这是大多数遵循的架构，在大公司中非常普遍。

使用机器学习方法预测IBM员工流失数据模型_机器学习_03

混合控制架构

同样，IBM公司可能遵循无形或混合结构。因此，对于人力资源部门来说，一项重要的责任是衡量员工在特定时间差距下的流失率。员工流失所依赖的因素是：

年龄
收入
加班
每月费用支出
离家的距离
工作年限

等等…

IBM还公开了他们的员工信息，并提供了问题陈述：

“ 预测员工的流失，即员工是否会减员，考虑到员工的详细信息，即导致员工流失的原因”

需要员工数据集的可以私信我

解决这个问题的方案是应用机器学习，通过传授机器智能涉及发展预测模型的训练,使用数据和验证模型性能分析....

以下是使用Python和Scikit-Learn机器学习工具箱进行机器学习模型开发的分步过程：

模型开发

#importing all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pylab as pl
from sklearn.metrics import roc_curve, auc
#loading the dataset using Pandas
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()# Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_04

使用机器学习方法预测IBM员工流失数据模型_机器学习_05

数据集的Pandas Dataframe的输出

# checking whether the dataset contains any missing values...
df.shape == df.dropna().shape # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_06

这是一个二元分类问题，因此，实例在两个类之间的分布可视化如下：

y_bar = np.array([df[df['Attrition']=='No'].shape[0]
 ,df[df['Attrition']=='Yes'].shape[0]])
x_bar = ['No (0)', 'Yes (1)'
# Bar Visualization 
plt.bar(x, y)
plt.xlabel('Labels/Classes')
plt.ylabel('Number of Instances')
plt.title('Distribution of Labels/Classes in the Dataset')
# Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_07

使用机器学习方法预测IBM员工流失数据模型_机器学习_08

条形分布的可视化图

# Label Encoding for Categorical/Non-Numeric Data
X = df.iloc[:,[0] + list(range(2,35))].values
y = df.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:,1] = labelencoder_X_1.fit_transform(X[:,1])
X[:,3] = labelencoder_X_1.fit_transform(X[:,3])
X[:,6] = labelencoder_X_1.fit_transform(X[:,6])
X[:,10] = labelencoder_X_1.fit_transform(X[:,10])
X[:,14] = labelencoder_X_1.fit_transform(X[:,14])
X[:,16] = labelencoder_X_1.fit_transform(X[:,16])
X[:,20] = labelencoder_X_1.fit_transform(X[:,20])
X[:,21] = labelencoder_X_1.fit_transform(X[:,21])
y = labelencoder_X_1.fit_transform(y)
# Feature Selection using Random Forest Classifier's Feature
# Importance Scores
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X,y) # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_09

list_importances=list(model.feature_importances_)
indices=sorted(range(len(list_importances)), key=lambda k
 :list_importances[k])
feature_selected=[None]*34
k=0
for i in reversed(indices):
 if k<=33:
 feature_selected[k]=i
 k=k+1
X_selected = X[:,feature_selected[:17]]
l_features=feature_selected
i=0
for x in feature_selected:
 l_features[i] = df.columns[x]
 i=i+1
l_features = np.array(l_features)
# Extracting 17 most important features among 34 features
l_features[:17] #Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_10

 # Selecting the 17 most important features
df_features = pd.DataFrame(X_selected, columns=['Age',
 'MonthlyIncome', 'OverTime',
 'EmployeeNumber', 'MonthlyRate',
 , 'DistanceFromHome', 'YearsAtCompany',
 'TotalWorkingYears', 'DailyRate', 
 'HourlyRate', 'NumCompaniesWorked',
 'JobInvolvement', 'PercentSalaryHike',
 'StockOptionLevel', 
 'YearsWithCurrManager',
 'EnvironmentSatisfaction',
 'EducationField', 'Attrition']]
df_selected.head() # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_11

使用机器学习方法预测IBM员工流失数据模型_机器学习_12

标签编码也必须对选定的分类特征进行编码：

# Label Encoding for selected Non-Numeric Features:
X = df_selected.iloc[:,list(range(0,17))].values
y = df_selected.iloc[:,17].values
X[:,2] = labelencoder_X_1.fit_transform(X[:,2])
X[:,16] = labelencoder_X_1.fit_transform(X[:,16])
y = labelencoder_X_1.fit_transform(y)

使用机器学习方法预测IBM员工流失数据模型_机器学习_13

现在数据预处理已经完成。继续进行模拟训练：

# 80-20 splitting where 80% Data is for Training the Model
# and 20% Data is for Validation and Performance Analysis
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
 test_size=0.2, random_state=1753)
# Using Logistic Regression Algorithm for Model Training
from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(verbose = 3)
# Training the Model
clf_trained = clf.fit(X_train, y_train) #Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_14

使用机器学习方法预测IBM员工流失数据模型_机器学习_15

这是逻辑回归使用的参数优化策略库

2. 模型性能分析：

=> 训练准确度

clf_trained.score(X_train,y_train) # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_16

该模型的训练精度为84.44%

=> 验证准确度

clf_trained.score(X_test,y_test) # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_17

验证该模型的准确度为89.12％

=> 精确，召回率和F1-分数

# getting the predictions...
predictions=clf_trained.predict(X_test)
print(classification_report(y_test,predictions))

使用机器学习方法预测IBM员工流失数据模型_机器学习_18

使用机器学习方法预测IBM员工流失数据模型_机器学习_19

模型分类报告

=> 混淆矩阵

# MODULE FOR CONFUSION MATRIX
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import itertools
def plot_confusion_matrix(cm, classes,
 normalize=False,
 title='Confusion matrix',
 cmap=plt.cm.Blues):
 """
 This function prints and plots the confusion matrix.
 Normalization can be applied by setting `normalize=True`.
 """
 if normalize:
 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 print("Normalized confusion matrix")
 else:
 print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
 plt.title(title)
 plt.colorbar()
 tick_marks = np.arange(len(classes))
 plt.xticks(tick_marks, classes, rotation=45)
 plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
 thresh = cm.max() / 2.
 for i, j in itertools.product(range(cm.shape[0])
 , range(cm.shape[1])):
 plt.text(j, i, format(cm[i, j], fmt),
 horizontalalignment="center",
 color="white" if cm[i, j] > thresh else "black")
 plt.tight_layout()
 plt.ylabel('True label')
 plt.xlabel('Predicted label')
# Generating the Confusion Matrix
plt.figure()
cm = np.array([[252, 1], [31, 10]])
plot_confusion_matrix(confusion_matrix(y_test,predictions), 
 classes=[0,1], normalize=True
 , title='Normalized Confusion Matrix')
# Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_20

使用机器学习方法预测IBM员工流失数据模型_机器学习_21

归一化混淆矩阵

=> 特征曲线：

# Plotting the ROC Curve
y_roc = np.array(y_test)
fpr, tpr, thresholds = roc_curve(y_roc, clf_trained.decision_function(X_test))
roc_auc = auc(fpr, tpr)
pl.clf()
pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.legend(loc="lower right")
pl.show() # Output shown below

使用机器学习方法预测IBM员工流失数据模型_机器学习_22