客户流失数据分析Python

转载

mob64ca14085c24 2024-07-09 20:07:55

文章标签 客户流失数据分析Python python sql 数据库大数据 文章分类 数据分析人工智能

一、背景介绍

某通信公司是通信界的巨头，其用户流失率若降低5%，那么公司利润将提升25%-85%。如今随着市场饱和度上升，高居不下的获客成本让公司遭遇了“天花板”，甚至陷入获客难的窘境。增加用户黏性和延长用户生命周期成了该通信亟待解决的问题。
数据来源：https://www.kaggle.com/blastchar/telco-customer-churn

二、分析目的

1、分析流失用户特征，生成易流失用户标签；
2、预测用户留存率随时间的变化，并提出合理化召回建议。

三、分析思路

客户流失数据分析Python_大数据

分析工具：Tableau、Mysql、python、excel

四、可视化分析用户流失

1、用户属性特征

用户的基本特征有：性别（Gender）、年龄（Senior，1：年长，2：年轻）、有无伴侣（Partner）、有无家属（Dependents），各特征用户的流失率如下图所示：

客户流失数据分析Python_数据库_02

从图中看出，年长用户、有伴侣和有家属的用户流失率明显较高，用户性别对流失率的影响不大。

2、用户服务属性

用户服务属性有：电话服务（PhoneService）、多条线路（MultipleLines）、网络服务（InternetService）、网络安全服务（OnlineSecurity），各特征用户的流失率如下图所示：

客户流失数据分析Python_sql_03

从图中可以看出，网络服务为Fiber optic、没有网络安全服务的客户流失率最高，其次是网络服务为DSL、有网络安全服务的客户，没有网络服务和网络安全服务的用户流失率最低。

3、用户交易属性

用户交易属性有：合同期限（Contract）、付款方式（PaymentMethod）、每月付费金额（MonthlyCharges）、总付费金额（TotalCharges），各特征用户的流失率如下图所示：

客户流失数据分析Python_sql_04

客户流失数据分析Python_python_05

客户流失数据分析Python_客户流失数据分析Python_06

从上图看出，合同期限为Month-to-month、付款方式为Electronic check、每月消费金额为70至100元、总消费300元以内的客户流失率最高。

4、小结

以下特征的用户最易流失：
1）年长用户、有伴侣、有家属；
2）网络服务为Fiber optic、没有网络安全服务；
3）同期限为Month-to-month、付款方式为Electronic check、每月消费金额为70至100元、总消费300元以内。

五、生成易流失等级标签

1、量化流失风险系数

各属性对用户流失的影响越大，则流失风险系数越高，具体划分如下：

客户流失数据分析Python_客户流失数据分析Python_07

用Mysql取出未流失客户，并计算风险系数：

SELECT customerID, IF(SeniorCitizen=1,2,0) as senior, IF(Partner='Yes',2,0) as partner, IF(Dependents='Yes',2,0) as dependents,
CASE 
  WHEN InternetService='Fiber optic' THEN
		2
	WHEN InternetService='DSL' THEN
	1
	ELSE
		0
END as internetservice,
CASE 
	WHEN OnlineSecurity='No' THEN
		2
	WHEN OnlineSecurity='Yes' THEN
	1
	ELSE
		0
END as onlinesecurity,
CASE 
	WHEN Contract='Month-to-month' THEN
		2
	WHEN Contract='One year' THEN
	1
	ELSE
		0
END as contract,
CASE 
	WHEN PaymentMethod='Electronic check' THEN
		2
	ELSE
		0
END as paymentMethod,
IF(MonthlyCharges>=70 and MonthlyCharges <=100,1,0) as monthlycharges,
IF(TotalCharges<300,1,0) as totalcharges
from ha.wa_fn;

查询结果如下:

客户流失数据分析Python_客户流失数据分析Python_08

2、汇总风险系数，求出最终用户流失风险等级

将查询结果导入Excel中，求出最终的流失分析等级(churn_level）：

客户流失数据分析Python_数据库_09

流失风险等级分布如下：

客户流失数据分析Python_大数据_10

接下来，运营部同事就可以根据流失风险等级，分层运营客户。

3、添加高流失风险标签

比如，风险等级大于9的定义为高流失风险客户：

客户流失数据分析Python_数据库_11

六、基于生存分析预测用户流失

1、导入模块

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# 生存分析模块
from lifelines import NelsonAalenFitter, CoxPHFitter, KaplanMeierFitter
from lifelines.statistics import logrank_test
# cox
from lifelines import CoxPHFitter
from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve 
# matplotlib与pandas初始设置
plt.rcParams['font.sans-serif'] = ['SimHei']  #设置中文字体为黑体
plt.rcParams['axes.unicode_minus'] = False #正常显示负号
pd.set_option('display.max_columns', 30)
plt.rcParams.update({"font.family":"SimHei","font.size":14})
plt.style.use("tableau-colorblind10")

pd.set_option('display.float_format',lambda x : '%.2f' % x)#pandas禁用科学计数法
%matplotlib inline 

#忽略警告
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
data_backup = data.copy()
data.head()

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

2、数据预处理

# 缺失值
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

# 由于TotalCharges列存在缺失值，所以强制转换成数字（info中并未显示缺失值，但如果正常转换就会报错）
data['TotalCharges']=pd.to_numeric(data['TotalCharges'],errors='coerce')
data['TotalCharges'].dtype

dtype('float64')

# 的确是存在缺失值
data.TotalCharges.isnull().sum()

# 删除缺失值
data.dropna(subset=['TotalCharges'],inplace=True)

# 重复值
data.duplicated('customerID').sum()

# 异常值
data.describe().T

	count	mean	std	min	25%	50%	75%	max
SeniorCitizen	7032.00	0.16	0.37	0.00	0.00	0.00	0.00	1.00
tenure	7032.00	32.42	24.55	1.00	9.00	29.00	55.00	72.00
MonthlyCharges	7032.00	64.80	30.09	18.25	35.59	70.35	89.86	118.75
TotalCharges	7032.00	2283.30	2266.77	18.80	401.45	1397.47	3794.74	8684.80

data.describe(include='object').T

	count	unique	top	freq
customerID	7032	7032	7590-VHVEG	1
gender	7032	2	Male	3549
Partner	7032	2	No	3639
Dependents	7032	2	No	4933
PhoneService	7032	2	Yes	6352
MultipleLines	7032	3	No	3385
InternetService	7032	3	Fiber optic	3096
OnlineSecurity	7032	3	No	3497
OnlineBackup	7032	3	No	3087
DeviceProtection	7032	3	No	3094
TechSupport	7032	3	No	3472
StreamingTV	7032	3	No	2809
StreamingMovies	7032	3	No	2781
Contract	7032	3	Month-to-month	3875
PaperlessBilling	7032	2	Yes	4168
PaymentMethod	7032	4	Electronic check	2365
Churn	7032	2	No	5163

3、分类数据转换

为了将数据代入模型，需要将分类数据转换成数字，这里用到了sklearn中的one-hoe-encode.

#分类数据转换为one-hoe-encode形式
list = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod','Churn']
lec = LabelEncoder()
data.loc[:,list]=data.loc[:,list].transform(lec.fit_transform)
# churn:N0：0；Yes:1
data.head()

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	0	0	1	0	1	0	1	0	0	2	0	0	0	0	0	1	2	29.85	29.85	0
1	5575-GNVDE	1	0	0	0	34	1	0	0	2	0	2	0	0	0	1	0	3	56.95	1889.50	0
2	3668-QPYBK	1	0	0	0	2	1	0	0	2	2	0	0	0	0	0	1	3	53.85	108.15	1
3	7795-CFOCW	1	0	0	0	45	0	1	0	2	0	2	2	0	0	1	0	0	42.30	1840.75	0
4	9237-HQITU	0	0	0	0	2	1	0	1	0	0	0	0	0	0	0	1	2	70.70	151.65	1

4、相关性分析

5、KM模型分析留存率

plt.figure(dpi=800)
kmf = KaplanMeierFitter()
kmf.fit(data['tenure'], event_observed=data['Churn'])
kmf.plot()
plt.title('Retain probability')

客户流失数据分析Python_python_13

6、Cox风险回归模型预测用户流失趋势

# 分割训练集和测试集
train_data, test_data = train_test_split(data, test_size=0.2)
print([column for column in train_data])

['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

构建Cox风险比例模型

formula ='gender+SeniorCitizen+Partner+Dependents+PhoneService+ \
MultipleLines+InternetService+OnlineSecurity+OnlineBackup+ \
DeviceProtection+TechSupport+StreamingTV+StreamingMovies+ \
Contract+PaperlessBilling+PaymentMethod+MonthlyCharges+TotalCharges'     
model = CoxPHFitter(penalizer=0.01, l1_ratio=1)
model = model.fit(train_data.drop("customerID",axis=1), 'tenure', event_col='Churn',formula=formula)
model.print_summary()

model	lifelines.CoxPHFitter
duration col	'tenure'
event col	'Churn'
penalizer	0.01
l1 ratio	1
baseline estimation	breslow
number of observations	5625
number of events observed	1493
partial log-likelihood	-10197.70
time fit was run	2022-10-28 01:18:04 UTC

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	cmp to	z	p	-log2(p)
Contract	-1.49	0.23	0.08	-1.64	-1.33	0.19	0.26	0.00	-18.76	<0.005	258.51
Dependents	-0.10	0.91	0.08	-0.25	0.05	0.78	1.05	0.00	-1.27	0.21	2.28
DeviceProtection	-0.04	0.96	0.03	-0.10	0.01	0.90	1.01	0.00	-1.47	0.14	2.82
InternetService	-0.08	0.92	0.05	-0.18	0.02	0.83	1.02	0.00	-1.63	0.10	3.28
MonthlyCharges	0.04	1.05	0.00	0.04	0.05	1.04	1.05	0.00	23.38	<0.005	399.04
MultipleLines	-0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.00	-0.00	1.00	0.00
OnlineBackup	-0.09	0.91	0.03	-0.15	-0.04	0.86	0.97	0.00	-3.14	<0.005	9.20
OnlineSecurity	-0.19	0.83	0.04	-0.26	-0.12	0.77	0.89	0.00	-5.16	<0.005	21.97
PaperlessBilling	0.09	1.09	0.06	-0.04	0.21	0.96	1.23	0.00	1.38	0.17	2.57
Partner	-0.15	0.86	0.06	-0.27	-0.02	0.77	0.98	0.00	-2.36	0.02	5.79
PaymentMethod	0.16	1.18	0.03	0.10	0.22	1.11	1.25	0.00	5.49	<0.005	24.58
PhoneService	-0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.00	-0.00	1.00	0.00
SeniorCitizen	0.02	1.02	0.06	-0.11	0.14	0.90	1.15	0.00	0.29	0.77	0.37
StreamingMovies	-0.02	0.98	0.03	-0.08	0.04	0.92	1.04	0.00	-0.66	0.51	0.97
StreamingTV	-0.02	0.98	0.03	-0.08	0.04	0.92	1.04	0.00	-0.60	0.55	0.87
TechSupport	-0.13	0.88	0.04	-0.20	-0.06	0.82	0.95	0.00	-3.51	<0.005	11.14
TotalCharges	-0.00	1.00	0.00	-0.00	-0.00	1.00	1.00	0.00	-32.61	<0.005	772.51
gender	-0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.00	-0.00	1.00	0.00

Concordance	0.93
Partial AIC	20431.41
log-likelihood ratio test	3935.73 on 18 df
-log2(p) of ll-ratio test	inf

从结果上看，一致性指数（Concordance）为0.93，说明模型效果很好。

7、评估预测效果

一致性指数

plt.figure(figsize = (6,10),dpi=600)
model.plot(hazard_ratios=True)
plt.xlabel('Hazard Ratios (95% CI)')
plt.title('Hazard Ratios')

客户流失数据分析Python_大数据_14

布里尔分数（Brier Score）

loss_dict = {} 
for i in range(1,72): 
    score = brier_score_loss( 
        test_data['Churn'], 1-np.array(model.predict_survival_function(test_data).loc[i]), pos_label=1 ) 
    loss_dict[i] = [score] 
    
loss_df = pd.DataFrame(loss_dict).T 

fig, ax = plt.subplots(dpi=600) 
ax.plot(loss_df.index, loss_df) 
ax.set(xlabel='Prediction Time', ylabel='Calibration Loss', title='Cox PH Model Calibration Loss / Time') 
plt.show()

客户流失数据分析Python_python_15

从图上看，模型对于预测40个月内的用户流失效果很好。

校准曲线（Calibration）

plt.figure(figsize=(10, 10),dpi=600)
 
ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2) 
ax.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

probs = 1-np.array(model.predict_survival_function(test_data).loc[7])

actual = test_data['Churn'] 
fraction_of_positives, mean_predicted_value = calibration_curve(actual, probs, n_bins=10, normalize=False) 

ax.plot(mean_predicted_value, fraction_of_positives, "s-", label="%s" % ("CoxPH",)) 
ax.set_ylabel("Fraction of positives") 
ax.set_ylim([-0.05, 1.05]) 
ax.legend(loc="lower right") 
ax.set_title('Calibration plots (reliability curve)')

客户流失数据分析Python_大数据_16

从图上看，模型低估了用户留存率，即高估了流失率。

8、预测抽样用户流失

nochurn_data=test_data.loc[test_data['Churn']==0]
churn_clients = pd.DataFrame(model.predict_survival_function(nochurn_data))
churn_clients

	3943	496	2618	6676	1311	5387	3015	2080	5445	4095	2928	6376	5230	870	5422	...	2149	1380	501	4210	2110	6050	2937	5771	5022	190	1188	5236	5974	1668	1312
1.00	0.91	0.99	1.00	1.00	1.00	0.96	1.00	1.00	1.00	0.77	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	0.98	1.00	1.00	0.98	1.00	1.00	1.00	1.00	0.99	1.00	1.00
2.00	0.87	0.99	1.00	1.00	1.00	0.95	1.00	1.00	1.00	0.68	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	0.98	1.00	1.00	0.98	1.00	1.00	1.00	1.00	0.99	1.00	1.00
3.00	0.84	0.98	1.00	1.00	1.00	0.93	1.00	1.00	1.00	0.62	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	0.97	1.00	1.00	0.97	1.00	1.00	1.00	1.00	0.98	1.00	1.00
4.00	0.80	0.98	1.00	1.00	1.00	0.92	1.00	1.00	1.00	0.55	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	0.96	1.00	1.00	0.97	1.00	1.00	1.00	1.00	0.98	1.00	1.00
5.00	0.77	0.97	1.00	0.99	1.00	0.91	1.00	1.00	1.00	0.49	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	0.96	1.00	1.00	0.96	1.00	1.00	0.99	1.00	0.97	1.00	1.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
68.00	0.00	0.00	0.06	0.03	0.20	0.00	0.97	0.99	0.96	0.00	1.00	0.80	0.97	1.00	0.98	...	0.32	0.94	0.98	0.60	0.00	0.74	0.90	0.00	1.00	0.41	0.03	0.20	0.00	0.09	0.59
69.00	0.00	0.00	0.04	0.02	0.16	0.00	0.97	0.99	0.95	0.00	1.00	0.78	0.97	1.00	0.98	...	0.28	0.93	0.98	0.56	0.00	0.71	0.89	0.00	1.00	0.36	0.02	0.16	0.00	0.07	0.55
70.00	0.00	0.00	0.01	0.00	0.09	0.00	0.96	0.98	0.94	0.00	0.99	0.71	0.96	0.99	0.97	...	0.18	0.91	0.97	0.46	0.00	0.64	0.86	0.00	1.00	0.26	0.01	0.08	0.00	0.03	0.45
71.00	0.00	0.00	0.01	0.00	0.05	0.00	0.95	0.98	0.93	0.00	0.99	0.67	0.95	0.99	0.96	...	0.13	0.89	0.97	0.40	0.00	0.58	0.83	0.00	1.00	0.20	0.00	0.05	0.00	0.01	0.39
72.00	0.00	0.00	0.00	0.00	0.04	0.00	0.95	0.98	0.92	0.00	0.99	0.63	0.95	0.99	0.96	...	0.10	0.88	0.96	0.35	0.00	0.54	0.81	0.00	1.00	0.16	0.00	0.03	0.00	0.01	0.34

72 rows × 1031 columns

plt.figure(figsize=(10, 10),dpi=600)
churn_clients[churn_clients.columns[0]].plot(color='c')
churn_clients[churn_clients.columns[1]].plot(color='y')
churn_clients[churn_clients.columns[21]].plot(color='m')
churn_clients[1311].plot(color='g')
plt.plot([i for i in range(0,20)],[0.5 for i in range(0,20)],'k--', label='Threshold=0.5')
plt.ylim(0,1)
plt.xlim(0,72)
plt.xlabel('Timeline')
plt.ylabel('Retain probability')
plt.legend(loc='best')
plt.title('The Churn Trend of Samples')

客户流失数据分析Python_python_17