比赛概览
拍拍贷“魔镜风控系统”从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,在此基础上,再结合新发标的信息,打出对于每个标的6个月内逾期率的预测,为投资人提供了关键的决策依据,促进健康高效的互联网金融。拍拍贷首次开放丰富而真实的历史数据,邀你PK“魔镜风控系统”,通过机器学习技术,你能设计出更具预测准确率和计算性能的违约预测算法吗?
比赛规则
参赛团队需要基于训练集数据构建预测模型,使用模型计算测试集的评分(评分数值越高,表示越有可能出现贷款违约)。
模型评价标准:本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)curve下方的面积的大小。
比赛数据
本次大赛将公开国内网络借贷行业的贷款风险数据,包括信用违约标签(因变量)、建模所需的基础与加工字段(自变量)、相关用户的网络行为原始数据。本着保护借款人隐私以及拍拍贷知识产权的目的,数据字段已经过脱敏处理。
数据编码为GBK。初赛数据包括3万条训练集和2万条测试集。复赛会增加新的3万条数据,供参赛团队优化模型,并新增1万条数据作为测试集。所有训练集,测试集都包括3个csv文件。
Master
每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。
- idx:每一笔贷款的unique key,可以与另外2个文件里的idx相匹配。
- UserInfo_*:借款人特征字段
- WeblogInfo_*:Info网络行为字段
- Education_Info*:学历学籍字段
- ThirdParty_Info_PeriodN_*:第三方数据时间段N字段
- SocialNetwork_*:社交网络字段
- LinstingInfo:借款成交时间
- Target:违约标签(1 = 贷款违约,0 = 正常还款)。测试集里不包含target字段。
Log_Info
借款人的登陆信息。
- ListingInfo:借款成交时间
- LogInfo1:操作代码
- LogInfo2:操作类别
- LogInfo3:登陆时间
- idx:每一笔贷款的unique key
Userupdate_Info
借款人修改信息
- ListingInfo1:借款成交时间
- UserupdateInfo1:修改内容
- UserupdateInfo2:修改时间
- idx:每一笔贷款的unique key
# 引入Package
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')
import arrow
# 引入Package
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')
import arrow
# 用arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding
def parse_date(date_str, str_format='YYYY/MM/DD'):
d = arrow.get(date_str, str_format)
# 月初,月中,月末
month_stage = int((d.day-1) / 10) + 1
return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)
# 显示列名
def show_cols(df):
for c in df.columns:
print(c)
# 用arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding
def parse_date(date_str, str_format='YYYY/MM/DD'):
d = arrow.get(date_str, str_format)
# 月初,月中,月末
month_stage = int((d.day-1) / 10) + 1
return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)
# 显示列名
def show_cols(df):
for c in df.columns:
print(c)
读取数据
path = '/Training Set'
# path = './PPD-First-Round-Data-Update/Training Set'
train_master = pd.read_csv('PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
train_loginfo = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
train_userinfo = pd.read_csv('PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')
path = '/Training Set'
# path = './PPD-First-Round-Data-Update/Training Set'
train_master = pd.read_csv('PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
train_loginfo = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
train_userinfo = pd.read_csv('PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')
数据清洗
- 删除数据缺失比例很大的列,比如超过20%为nan
- 删除数据缺失比例大的行,并保持删除的行数不超过总体的1%
- 填补剩余缺失值,通过value_count观察是连续/离散变量,然后用最高频/平均数填补nan。这里通过观察,而不是判断类型是否object,更贴近实际情况
# train_master中每一列的NULL值数量
null_sum = train_master.isnull().sum()
# train_master中每一列的NULL值数量不为0的
null_sum = null_sum[null_sum!=0]
null_sum_df = DataFrame(null_sum, columns=['num'])
# 缺失率
null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)
print(null_sum_df.head(10))
# 删除缺失严重的列
train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],
axis=1, inplace=True)
# train_master中每一列的NULL值数量
null_sum = train_master.isnull().sum()
# train_master中每一列的NULL值数量不为0的
null_sum = null_sum[null_sum!=0]
null_sum_df = DataFrame(null_sum, columns=['num'])
# 缺失率
null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)
print(null_sum_df.head(10))
# 删除缺失严重的列
train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],
axis=1, inplace=True)
num ratio
WeblogInfo_3 29030 0.967667
WeblogInfo_1 29030 0.967667
UserInfo_11 18909 0.630300
UserInfo_13 18909 0.630300
UserInfo_12 18909 0.630300
WeblogInfo_20 8050 0.268333
WeblogInfo_21 3074 0.102467
WeblogInfo_19 2963 0.098767
WeblogInfo_2 1658 0.055267
WeblogInfo_4 1651 0.055033
# 删除缺失严重的行
record_nan = train_master.isnull().sum(axis=1).sort_values(ascending=False)
print(record_nan.head())
# 删除缺失数量>=5的行
drop_record_index = [i for i in record_nan.loc[(record_nan>=5)].index]
# 删除之前(30000, 222)
print('before train_master shape {}'.format(train_master.shape))
train_master.drop(drop_record_index, inplace=True)
# 删除之后(29189, 222)
print('after train_master shape {}'.format(train_master.shape))
# len(drop_record_index)
# 删除缺失严重的行
record_nan = train_master.isnull().sum(axis=1).sort_values(ascending=False)
print(record_nan.head())
# 删除缺失数量>=5的行
drop_record_index = [i for i in record_nan.loc[(record_nan>=5)].index]
# 删除之前(30000, 222)
print('before train_master shape {}'.format(train_master.shape))
train_master.drop(drop_record_index, inplace=True)
# 删除之后(29189, 222)
print('after train_master shape {}'.format(train_master.shape))
# len(drop_record_index)
29341 33
18637 31
17386 31
29130 31
29605 31
dtype: int64
before train_master shape (30000, 222)
after train_master shape (29189, 222)
# 所有nan值的数量
print('before all nan num: {}'.format(train_master.isnull().sum().sum()))
# UserInfo_2为null的行,UserInfo_2列置为'位置地点'
train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
# UserInfo_4为null的行,UserInfo_4列置为'位置地点'
train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'
def fill_nan(f, method):
if method == 'most':
# 填充为最高频
common_value = pd.value_counts(train_master[f], ascending=False).index[0]
else:
# 填充为均值
common_value = train_master[f].mean()
train_master.loc[train_master[f].isnull(), f] = common_value
# 通过pd.value_counts(train_master[f])的观察得到经验
fill_nan('UserInfo_1', 'most')
fill_nan('UserInfo_3', 'most')
fill_nan('WeblogInfo_2', 'most')
fill_nan('WeblogInfo_4', 'mean')
fill_nan('WeblogInfo_5', 'mean')
fill_nan('WeblogInfo_6', 'mean')
fill_nan('WeblogInfo_19', 'most')
fill_nan('WeblogInfo_21', 'most')
print('after all nan num: {}'.format(train_master.isnull().sum().sum()))
# 所有nan值的数量
print('before all nan num: {}'.format(train_master.isnull().sum().sum()))
# UserInfo_2为null的行,UserInfo_2列置为'位置地点'
train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
# UserInfo_4为null的行,UserInfo_4列置为'位置地点'
train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'
def fill_nan(f, method):
if method == 'most':
# 填充为最高频
common_value = pd.value_counts(train_master[f], ascending=False).index[0]
else:
# 填充为均值
common_value = train_master[f].mean()
train_master.loc[train_master[f].isnull(), f] = common_value
# 通过pd.value_counts(train_master[f])的观察得到经验
fill_nan('UserInfo_1', 'most')
fill_nan('UserInfo_3', 'most')
fill_nan('WeblogInfo_2', 'most')
fill_nan('WeblogInfo_4', 'mean')
fill_nan('WeblogInfo_5', 'mean')
fill_nan('WeblogInfo_6', 'mean')
fill_nan('WeblogInfo_19', 'most')
fill_nan('WeblogInfo_21', 'most')
print('after all nan num: {}'.format(train_master.isnull().sum().sum()))
before all nan num: 0
9725
13478
25688
24185
23997
after all nan num: 0
Feature 分类
- 所有的分类中,如果其中最大频率的值出现超过一定阈值(50%),则把这列转化成为2值。比如[0,1,2,0,0,0,4,0,3]转化为[0,1,1,0,0,0,1,0,1]
- 剩余的feature中,根据dtype,把所有features分为numerical和categorical 2类
- numerical中,如果unique num不超过10个,也归属为categorical分类
ratio_threshold = 0.5
binarized_features = []
binarized_features_most_freq_value = []
# 不同period的third_party_feature均值汇总在一起,结果并不好,故取消
# third_party_features = []
# 遍历所有列,除去target之外的列进行如下处理
for f in train_master.columns:
if f in ['target']:
continue
# 非空值的数量
not_null_sum = (train_master[f].notnull()).sum()
# 特征取值最多的index
most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
# 特征取值最多的值
most_value = pd.value_counts(train_master[f], ascending=False).index[0]
# 特征取值最多的值占所有取值的比率
ratio = most_count / not_null_sum
# 如果大于阈值则归入二值化的特征中
if ratio > ratio_threshold:
binarized_features.append(f)
binarized_features_most_freq_value.append(most_value)
# 数值型特征(除去类型为object,除去'Idx', 'target',除去binarized_features)
numerical_features = [f for f in train_master.select_dtypes(exclude = ['object']).columns
if f not in(['Idx', 'target']) and f not in binarized_features]
# 类别型特征(包括类型为object,除去'Idx', 'target',除去binarized_features)
categorical_features = [f for f in train_master.select_dtypes(include = ["object"]).columns
if f not in(['Idx', 'target']) and f not in binarized_features]
# 遍历所有二值化特征,加名称前缀b_,将最多的取值置为0,其余的置为1
for i in range(len(binarized_features)):
f = binarized_features[i]
most_value = binarized_features_most_freq_value[i]
train_master['b_' + f] = 1
train_master.loc[train_master[f] == most_value, 'b_' + f] = 0
train_master.drop([f], axis=1, inplace=True)
ratio_threshold = 0.5
binarized_features = []
binarized_features_most_freq_value = []
# 不同period的third_party_feature均值汇总在一起,结果并不好,故取消
# third_party_features = []
# 遍历所有列,除去target之外的列进行如下处理
for f in train_master.columns:
if f in ['target']:
continue
# 非空值的数量
not_null_sum = (train_master[f].notnull()).sum()
# 特征取值最多的index
most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
# 特征取值最多的值
most_value = pd.value_counts(train_master[f], ascending=False).index[0]
# 特征取值最多的值占所有取值的比率
ratio = most_count / not_null_sum
# 如果大于阈值则归入二值化的特征中
if ratio > ratio_threshold:
binarized_features.append(f)
binarized_features_most_freq_value.append(most_value)
# 数值型特征(除去类型为object,除去'Idx', 'target',除去binarized_features)
numerical_features = [f for f in train_master.select_dtypes(exclude = ['object']).columns
if f not in(['Idx', 'target']) and f not in binarized_features]
# 类别型特征(包括类型为object,除去'Idx', 'target',除去binarized_features)
categorical_features = [f for f in train_master.select_dtypes(include = ["object"]).columns
if f not in(['Idx', 'target']) and f not in binarized_features]
# 遍历所有二值化特征,加名称前缀b_,将最多的取值置为0,其余的置为1
for i in range(len(binarized_features)):
f = binarized_features[i]
most_value = binarized_features_most_freq_value[i]
train_master['b_' + f] = 1
train_master.loc[train_master[f] == most_value, 'b_' + f] = 0
train_master.drop([f], axis=1, inplace=True)
feature_unique_count = []
# 遍历数值型特征,统计各个特征取值不为0的数量
for f in numerical_features:
feature_unique_count.append((np.count_nonzero(train_master[f].unique()), f))
# print(sorted(feature_unique_count))
# 遍历,将取值数量<=10的归为类别型特征
for c, f in feature_unique_count:
if c <= 10:
print('{} moved from numerical to categorical'.format(f))
numerical_features.remove(f)
categorical_features.append(f)
feature_unique_count = []
# 遍历数值型特征,统计各个特征取值不为0的数量
for f in numerical_features:
feature_unique_count.append((np.count_nonzero(train_master[f].unique()), f))
# print(sorted(feature_unique_count))
# 遍历,将取值数量<=10的归为类别型特征
for c, f in feature_unique_count:
if c <= 10:
print('{} moved from numerical to categorical'.format(f))
numerical_features.remove(f)
categorical_features.append(f)
[(60, 'WeblogInfo_4'), (59, 'WeblogInfo_6'), (167, 'WeblogInfo_7'), (64, 'WeblogInfo_16'), (103, 'WeblogInfo_17'), (38, 'UserInfo_18'), (273, 'ThirdParty_Info_Period1_1'), (252, 'ThirdParty_Info_Period1_2'), (959, 'ThirdParty_Info_Period1_3'), (916, 'ThirdParty_Info_Period1_4'), (387, 'ThirdParty_Info_Period1_5'), (329, 'ThirdParty_Info_Period1_6'), (1217, 'ThirdParty_Info_Period1_7'), (563, 'ThirdParty_Info_Period1_8'), (111, 'ThirdParty_Info_Period1_11'), (18784, 'ThirdParty_Info_Period1_13'), (17989, 'ThirdParty_Info_Period1_14'), (5073, 'ThirdParty_Info_Period1_15'), (20047, 'ThirdParty_Info_Period1_16'), (14785, 'ThirdParty_Info_Period1_17'), (336, 'ThirdParty_Info_Period2_1'), (298, 'ThirdParty_Info_Period2_2'), (1192, 'ThirdParty_Info_Period2_3'), (1149, 'ThirdParty_Info_Period2_4'), (450, 'ThirdParty_Info_Period2_5'), (431, 'ThirdParty_Info_Period2_6'), (1524, 'ThirdParty_Info_Period2_7'), (715, 'ThirdParty_Info_Period2_8'), (134, 'ThirdParty_Info_Period2_11'), (21685, 'ThirdParty_Info_Period2_13'), (20719, 'ThirdParty_Info_Period2_14'), (6582, 'ThirdParty_Info_Period2_15'), (22385, 'ThirdParty_Info_Period2_16'), (18554, 'ThirdParty_Info_Period2_17'), (339, 'ThirdParty_Info_Period3_1'), (293, 'ThirdParty_Info_Period3_2'), (1172, 'ThirdParty_Info_Period3_3'), (1168, 'ThirdParty_Info_Period3_4'), (453, 'ThirdParty_Info_Period3_5'), (428, 'ThirdParty_Info_Period3_6'), (1511, 'ThirdParty_Info_Period3_7'), (707, 'ThirdParty_Info_Period3_8'), (129, 'ThirdParty_Info_Period3_11'), (21521, 'ThirdParty_Info_Period3_13'), (20571, 'ThirdParty_Info_Period3_14'), (6569, 'ThirdParty_Info_Period3_15'), (22247, 'ThirdParty_Info_Period3_16'), (18311, 'ThirdParty_Info_Period3_17'), (324, 'ThirdParty_Info_Period4_1'), (295, 'ThirdParty_Info_Period4_2'), (1183, 'ThirdParty_Info_Period4_3'), (1143, 'ThirdParty_Info_Period4_4'), (447, 'ThirdParty_Info_Period4_5'), (422, 'ThirdParty_Info_Period4_6'), (1524, 'ThirdParty_Info_Period4_7'), (706, 'ThirdParty_Info_Period4_8'), (130, 'ThirdParty_Info_Period4_11'), (20894, 'ThirdParty_Info_Period4_13'), (20109, 'ThirdParty_Info_Period4_14'), (6469, 'ThirdParty_Info_Period4_15'), (21644, 'ThirdParty_Info_Period4_16'), (17849, 'ThirdParty_Info_Period4_17'), (322, 'ThirdParty_Info_Period5_1'), (284, 'ThirdParty_Info_Period5_2'), (1144, 'ThirdParty_Info_Period5_3'), (1119, 'ThirdParty_Info_Period5_4'), (436, 'ThirdParty_Info_Period5_5'), (401, 'ThirdParty_Info_Period5_6'), (1470, 'ThirdParty_Info_Period5_7'), (685, 'ThirdParty_Info_Period5_8'), (126, 'ThirdParty_Info_Period5_11'), (20010, 'ThirdParty_Info_Period5_13'), (19145, 'ThirdParty_Info_Period5_14'), (6033, 'ThirdParty_Info_Period5_15'), (20723, 'ThirdParty_Info_Period5_16'), (17149, 'ThirdParty_Info_Period5_17'), (312, 'ThirdParty_Info_Period6_1'), (265, 'ThirdParty_Info_Period6_2'), (1074, 'ThirdParty_Info_Period6_3'), (1046, 'ThirdParty_Info_Period6_4'), (414, 'ThirdParty_Info_Period6_5'), (363, 'ThirdParty_Info_Period6_6'), (1411, 'ThirdParty_Info_Period6_7'), (637, 'ThirdParty_Info_Period6_8'), (71, 'ThirdParty_Info_Period6_9'), (15, 'ThirdParty_Info_Period6_10'), (123, 'ThirdParty_Info_Period6_11'), (95, 'ThirdParty_Info_Period6_12'), (16605, 'ThirdParty_Info_Period6_13'), (16170, 'ThirdParty_Info_Period6_14'), (5188, 'ThirdParty_Info_Period6_15'), (17220, 'ThirdParty_Info_Period6_16'), (14553, 'ThirdParty_Info_Period6_17')]
Feature Engineering 特征工程
numerical 数值型特征
- 所有的numerical feature,画出在不同target下的分布图,stripplot(with jitter),类似于boxplot,不过更方便于大值outlier寻找
- 绘制所有numerical features的密度图,并且可以观察出,它们都可以通过求对数转化为更接近正态分布
- 转化为log分布后,可以再删除一些极小的outlier
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
print(melt.head(50))
print(melt.shape)
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
print(melt.head(50))
print(melt.shape)
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
target variable value
0 0 WeblogInfo_4 1.000000
1 0 WeblogInfo_4 1.000000
2 0 WeblogInfo_4 2.000000
3 0 WeblogInfo_4 3.027468
4 0 WeblogInfo_4 1.000000
5 0 WeblogInfo_4 2.000000
6 1 WeblogInfo_4 13.000000
7 0 WeblogInfo_4 12.000000
8 1 WeblogInfo_4 10.000000
9 0 WeblogInfo_4 1.000000
10 0 WeblogInfo_4 3.000000
11 0 WeblogInfo_4 1.000000
12 0 WeblogInfo_4 11.000000
13 1 WeblogInfo_4 1.000000
14 0 WeblogInfo_4 3.000000
15 0 WeblogInfo_4 2.000000
16 0 WeblogInfo_4 4.000000
17 0 WeblogInfo_4 4.000000
18 1 WeblogInfo_4 1.000000
19 0 WeblogInfo_4 2.000000
20 0 WeblogInfo_4 3.000000
21 0 WeblogInfo_4 3.000000
22 0 WeblogInfo_4 8.000000
23 0 WeblogInfo_4 1.000000
24 0 WeblogInfo_4 1.000000
25 0 WeblogInfo_4 2.000000
26 0 WeblogInfo_4 9.000000
27 0 WeblogInfo_4 2.000000
28 0 WeblogInfo_4 2.000000
29 0 WeblogInfo_4 2.000000
30 0 WeblogInfo_4 3.000000
31 0 WeblogInfo_4 6.000000
32 0 WeblogInfo_4 1.000000
33 0 WeblogInfo_4 3.000000
34 0 WeblogInfo_4 3.027468
35 0 WeblogInfo_4 6.000000
36 0 WeblogInfo_4 9.000000
37 0 WeblogInfo_4 2.000000
38 1 WeblogInfo_4 5.000000
39 0 WeblogInfo_4 2.000000
40 0 WeblogInfo_4 2.000000
41 0 WeblogInfo_4 3.000000
42 0 WeblogInfo_4 3.027468
43 0 WeblogInfo_4 15.000000
44 0 WeblogInfo_4 2.000000
45 0 WeblogInfo_4 3.000000
46 0 WeblogInfo_4 3.000000
47 0 WeblogInfo_4 2.000000
48 0 WeblogInfo_4 3.000000
49 0 WeblogInfo_4 2.000000
(2714577, 3)
E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
warnings.warn(warning)
<seaborn.axisgrid.FacetGrid at 0x4491c80860>
# Seaborn画图,查看正负样本中特征取值的分布,删除离群值
print('{} lines before drop'.format(train_master.shape[0]))
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_1 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_2 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_3 > 2000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_3 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_4 > 1500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_7 > 2000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_6 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 1000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_16 > 2000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_14 > 1000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_12 > 60)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 120) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 20) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 200000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 150000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_15 > 40000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_17 > 130000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_1 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_2 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 3000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 2000)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_5 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_4 > 2000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_6 > 700].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_6 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_7 > 4000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_8 > 800)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_11 > 200)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_13 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_15 > 75000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_16 > 180000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_17 > 150000].index, inplace=True)
# go above
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_1 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_2 > 350)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_3 > 1500)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_4 > 1600].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_5 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_13 > 250000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_15 > 70000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_16 > 210000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_17 > 160000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_1 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_2 > 380].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_3 > 1750].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_4 > 1750].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_5 > 600].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_7 > 1600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_15 > 80000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_17 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_1 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_1 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_2 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_3 > 1800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_3 > 1500) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_4 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_5 > 580].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_7 > 2100].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_8 > 700) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_11 > 120].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_14 > 170000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_15 > 80000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_15 > 50000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_17 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_1 > 350].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_1 > 200) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_2 > 300].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_2 > 190) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_3 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_4 > 1250].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_5 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_6 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_6 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_7 > 1800].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_8 > 720].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_8 > 600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_11 > 100].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_13 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_13 > 140000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_15 > 70000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_15 > 30000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_16 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_16 > 100000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_17 > 100000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_17 > 80000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_4 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_6 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_7 > 150].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_16 > 50].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_16 > 25) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_17 > 100].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_17 > 80) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.UserInfo_18 < 10].index, inplace=True)
print('{} lines after drop'.format(train_master.shape[0]))
# Seaborn画图,查看正负样本中特征取值的分布,删除离群值
print('{} lines before drop'.format(train_master.shape[0]))
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_1 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_2 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_3 > 2000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_3 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_4 > 1500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_7 > 2000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_6 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 1000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_16 > 2000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_14 > 1000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_12 > 60)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 120) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 20) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 200000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 150000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_15 > 40000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_17 > 130000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_1 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_2 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 3000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 2000)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_5 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_4 > 2000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_6 > 700].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_6 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_7 > 4000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_8 > 800)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_11 > 200)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_13 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_15 > 75000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_16 > 180000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_17 > 150000].index, inplace=True)
# go above
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_1 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_2 > 350)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_3 > 1500)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_4 > 1600].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_5 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_13 > 250000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_15 > 70000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_16 > 210000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_17 > 160000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_1 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_2 > 380].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_3 > 1750].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_4 > 1750].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_5 > 600].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_7 > 1600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_15 > 80000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_17 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_1 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_1 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_2 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_3 > 1800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_3 > 1500) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_4 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_5 > 580].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_7 > 2100].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_8 > 700) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_11 > 120].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_14 > 170000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_15 > 80000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_15 > 50000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_17 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_1 > 350].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_1 > 200) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_2 > 300].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_2 > 190) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_3 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_4 > 1250].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_5 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_6 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_6 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_7 > 1800].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_8 > 720].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_8 > 600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_11 > 100].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_13 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_13 > 140000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_15 > 70000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_15 > 30000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_16 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_16 > 100000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_17 > 100000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_17 > 80000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_4 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_6 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_7 > 150].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_16 > 50].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_16 > 25) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_17 > 100].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_17 > 80) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.UserInfo_18 < 10].index, inplace=True)
print('{} lines after drop'.format(train_master.shape[0]))
29189 lines before drop
28074 lines after drop
# melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features if f != 'Idx'])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
# melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features if f != 'Idx'])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
E:\Anaconda3\envs\sklearn\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<seaborn.axisgrid.FacetGrid at 0x44984f02e8>
# train_master_log = train_master.copy()
numerical_features_log = [f for f in numerical_features if f not in ['Idx']]
# 将数值型特征取log
for f in numerical_features_log:
train_master[f + '_log'] = np.log1p(train_master[f])
train_master.drop([f], axis=1, inplace=True)
# train_master_log = train_master.copy()
numerical_features_log = [f for f in numerical_features if f not in ['Idx']]
# 将数值型特征取log
for f in numerical_features_log:
train_master[f + '_log'] = np.log1p(train_master[f])
train_master.drop([f], axis=1, inplace=True)
E:\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log1p
from math import inf
(train_master == -inf).sum().sum()
from math import inf
(train_master == -inf).sum().sum()
206845
train_master.replace(-inf, -1, inplace=True)
train_master.replace(-inf, -1, inplace=True)
# log后的密度图,应该分布靠近正态分布了
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f+'_log' for f in numerical_features])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
# log后的密度图,应该分布靠近正态分布了
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f+'_log' for f in numerical_features])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
<seaborn.axisgrid.FacetGrid at 0x44f45c2470>
# log后的分布图,看是否有log后的outlier
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
# log后的分布图,看是否有log后的outlier
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
warnings.warn(warning)
<seaborn.axisgrid.FacetGrid at 0x44e4270908>
categorical 类别型特征
melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in categorical_features])
g = sns.FacetGrid(melt, col='variable', col_wrap=4, sharex=False, sharey=False)
g.map(sns.countplot, 'value', palette="muted")
melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in categorical_features])
g = sns.FacetGrid(melt, col='variable', col_wrap=4, sharex=False, sharey=False)
g.map(sns.countplot, 'value', palette="muted")
E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
warnings.warn(warning)
<seaborn.axisgrid.FacetGrid at 0x44e4c3eac8>
相关性查看
target_corr = np.abs(train_master.corr()['target']).sort_values(ascending=False)
target_corr
target_corr = np.abs(train_master.corr()['target']).sort_values(ascending=False)
target_corr
target 1.000000
ThirdParty_Info_Period6_5_log 0.139606
ThirdParty_Info_Period6_11_log 0.139083
ThirdParty_Info_Period6_4_log 0.137962
ThirdParty_Info_Period6_7_log 0.135729
ThirdParty_Info_Period6_3_log 0.132310
ThirdParty_Info_Period6_14_log 0.131138
ThirdParty_Info_Period6_8_log 0.130577
ThirdParty_Info_Period6_16_log 0.128451
ThirdParty_Info_Period6_13_log 0.128013
ThirdParty_Info_Period5_5_log 0.126701
ThirdParty_Info_Period6_17_log 0.126456
ThirdParty_Info_Period5_4_log 0.121786
ThirdParty_Info_Period6_10_log 0.121729
ThirdParty_Info_Period6_1_log 0.121112
ThirdParty_Info_Period5_11_log 0.117162
ThirdParty_Info_Period5_7_log 0.114794
ThirdParty_Info_Period6_2_log 0.112041
ThirdParty_Info_Period6_9_log 0.112039
ThirdParty_Info_Period5_14_log 0.111374
ThirdParty_Info_Period5_3_log 0.108039
ThirdParty_Info_Period5_16_log 0.104786
ThirdParty_Info_Period6_12_log 0.104733
ThirdParty_Info_Period5_13_log 0.104688
ThirdParty_Info_Period5_1_log 0.104191
ThirdParty_Info_Period5_8_log 0.102859
ThirdParty_Info_Period4_5_log 0.101329
ThirdParty_Info_Period5_17_log 0.100960
ThirdParty_Info_Period4_4_log 0.094715
ThirdParty_Info_Period5_2_log 0.090261
...
ThirdParty_Info_Period4_15_log 0.004560
b_ThirdParty_Info_Period4_12 0.004331
b_WeblogInfo_13 0.004090
b_SocialNetwork_4 0.003752
b_SocialNetwork_3 0.003752
b_SocialNetwork_2 0.003752
b_SocialNetwork_16 0.003711
b_SocialNetwork_6 0.003701
b_SocialNetwork_5 0.003701
b_WeblogInfo_44 0.003542
WeblogInfo_7_log 0.003414
b_WeblogInfo_32 0.002961
WeblogInfo_16_log 0.002954
b_ThirdParty_Info_Period2_12 0.002925
b_WeblogInfo_29 0.002550
b_WeblogInfo_41 0.002522
ThirdParty_Info_Period4_6_log 0.002362
b_WeblogInfo_11 0.002257
b_WeblogInfo_12 0.002209
b_WeblogInfo_8 0.001922
b_WeblogInfo_40 0.001759
b_WeblogInfo_36 0.001554
b_WeblogInfo_26 0.001357
ThirdParty_Info_Period1_3_log 0.000937
b_WeblogInfo_31 0.000896
b_WeblogInfo_23 0.000276
ThirdParty_Info_Period1_8_log 0.000194
b_WeblogInfo_38 0.000077
b_WeblogInfo_10 NaN
b_WeblogInfo_49 NaN
Name: target, Length: 215, dtype: float64
# at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。
train_master['at_home'] = np.where(train_master['UserInfo_2']==train_master['UserInfo_8'], 1, 0)
train_master['at_home']
# at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。
train_master['at_home'] = np.where(train_master['UserInfo_2']==train_master['UserInfo_8'], 1, 0)
train_master['at_home']
0 1
1 1
2 1
3 1
4 1
5 0
6 0
7 0
9 0
10 1
11 1
12 1
13 1
14 0
15 1
16 0
17 1
18 1
19 1
20 1
21 0
22 0
23 0
24 0
25 1
26 1
27 1
28 1
29 0
30 1
..
29970 0
29971 1
29972 1
29973 0
29974 0
29975 1
29976 0
29977 1
29978 1
29979 1
29980 0
29981 0
29982 1
29983 0
29984 0
29985 0
29986 0
29987 1
29988 1
29989 1
29990 0
29991 1
29992 1
29993 1
29994 0
29995 1
29996 1
29997 0
29998 0
29999 1
Name: at_home, Length: 28074, dtype: int32
train_master_ = train_master.copy()
train_master_ = train_master.copy()
def parse_ListingInfo(date):
d = parse_date(date, 'YYYY/M/D')
return Series(d,
index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month',
'ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 'ListingInfo_month_stage'],
dtype=np.int32)
ListingInfo_parsed = train_master_['ListingInfo'].apply(parse_ListingInfo)
print('before train_master_ shape {}'.format(train_master_.shape))
train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
print('after train_master_ shape {}'.format(train_master_.shape))
def parse_ListingInfo(date):
d = parse_date(date, 'YYYY/M/D')
return Series(d,
index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month',
'ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 'ListingInfo_month_stage'],
dtype=np.int32)
ListingInfo_parsed = train_master_['ListingInfo'].apply(parse_ListingInfo)
print('before train_master_ shape {}'.format(train_master_.shape))
train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
print('after train_master_ shape {}'.format(train_master_.shape))
before train_master_ shape (28074, 223)
after train_master_ shape (28074, 230)
train_loginfo 借款人的登陆信息
- 对Idx做group,提取记录数,LogInfo1独立数,活跃日期数,日期跨度
def loginfo_aggr(group):
# group的数量
loginfo_num = group.shape[0]
# 操作代码的数量
loginfo_LogInfo1_unique_num = group['LogInfo1'].unique().shape[0]
# 登录时间的数量
loginfo_active_day_num = group['LogInfo3'].unique().shape[0]
# 处理登录时间最小值
min_day = parse_date(np.min(group['LogInfo3']), str_format='YYYY-MM-DD')
# 处理登陆时间最大值
max_day = parse_date(np.max(group['LogInfo3']), str_format='YYYY-MM-DD')
# 最大值和最小值相差多少天
gap_day = round((max_day[0] - min_day[0]) / 86400)
indexes = {
'loginfo_num': loginfo_num,
'loginfo_LogInfo1_unique_num': loginfo_LogInfo1_unique_num,
'loginfo_active_day_num': loginfo_active_day_num,
'loginfo_gap_day': gap_day,
'loginfo_last_day_timestamp': max_day[0]
}
# TODO every individual LogInfo1,LogInfo2 count
def sub_aggr_loginfo(sub_group):
return sub_group.shape[0]
sub_group = group.groupby(by=['LogInfo1', 'LogInfo2']).apply(sub_aggr_loginfo)
indexes['loginfo_LogInfo12_unique_num'] = sub_group.shape[0]
return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
train_loginfo_grouped = train_loginfo.groupby(by=['Idx']).apply(loginfo_aggr)
train_loginfo_grouped.head()
def loginfo_aggr(group):
# group的数量
loginfo_num = group.shape[0]
# 操作代码的数量
loginfo_LogInfo1_unique_num = group['LogInfo1'].unique().shape[0]
# 登录时间的数量
loginfo_active_day_num = group['LogInfo3'].unique().shape[0]
# 处理登录时间最小值
min_day = parse_date(np.min(group['LogInfo3']), str_format='YYYY-MM-DD')
# 处理登陆时间最大值
max_day = parse_date(np.max(group['LogInfo3']), str_format='YYYY-MM-DD')
# 最大值和最小值相差多少天
gap_day = round((max_day[0] - min_day[0]) / 86400)
indexes = {
'loginfo_num': loginfo_num,
'loginfo_LogInfo1_unique_num': loginfo_LogInfo1_unique_num,
'loginfo_active_day_num': loginfo_active_day_num,
'loginfo_gap_day': gap_day,
'loginfo_last_day_timestamp': max_day[0]
}
# TODO every individual LogInfo1,LogInfo2 count
def sub_aggr_loginfo(sub_group):
return sub_group.shape[0]
sub_group = group.groupby(by=['LogInfo1', 'LogInfo2']).apply(sub_aggr_loginfo)
indexes['loginfo_LogInfo12_unique_num'] = sub_group.shape[0]
return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
train_loginfo_grouped = train_loginfo.groupby(by=['Idx']).apply(loginfo_aggr)
train_loginfo_grouped.head()
loginfo_num | loginfo_LogInfo1_unique_num | loginfo_active_day_num | loginfo_gap_day | loginfo_last_day_timestamp | loginfo_LogInfo12_unique_num | |
Idx | ||||||
3 | 26 | 4 | 8 | 63 | 1383264000 | 9 |
5 | 11 | 6 | 4 | 13 | 1383696000 | 8 |
8 | 125 | 7 | 13 | 12 | 1383696000 | 11 |
12 | 199 | 8 | 11 | 328 | 1383264000 | 14 |
16 | 15 | 4 | 7 | 8 | 1383523200 | 6 |
train_loginfo_grouped.to_csv('train_loginfo_grouped.csv', header=True, index=True)
train_loginfo_grouped.to_csv('train_loginfo_grouped.csv', header=True, index=True)
train_loginfo_grouped = pd.read_csv('train_loginfo_grouped.csv')
train_loginfo_grouped.head()
train_loginfo_grouped = pd.read_csv('train_loginfo_grouped.csv')
train_loginfo_grouped.head()
Idx | loginfo_num | loginfo_LogInfo1_unique_num | loginfo_active_day_num | loginfo_gap_day | loginfo_last_day_timestamp | loginfo_LogInfo12_unique_num | |
0 | 3 | 26 | 4 | 8 | 63 | 1383264000 | 9 |
1 | 5 | 11 | 6 | 4 | 13 | 1383696000 | 8 |
2 | 8 | 125 | 7 | 13 | 12 | 1383696000 | 11 |
3 | 12 | 199 | 8 | 11 | 328 | 1383264000 | 14 |
4 | 16 | 15 | 4 | 7 | 8 | 1383523200 | 6 |
train_userinfo 借款人修改信息
- 对于Idx做group,提取记录数,UserupdateInfo1独立数、UserupdateInfo1/UserupdateInfo2独立数,日期跨度。以及每种UserupdateInfo1/UserupdateInfo2的数量
def userinfo_aggr(group):
op_columns = ['_EducationId', '_HasBuyCar', '_LastUpdateDate',
'_MarriageStatusId', '_MobilePhone', '_QQ', '_ResidenceAddress',
'_ResidencePhone', '_ResidenceTypeId', '_ResidenceYears', '_age',
'_educationId', '_gender', '_hasBuyCar', '_idNumber',
'_lastUpdateDate', '_marriageStatusId', '_mobilePhone', '_qQ',
'_realName', '_regStepId', '_residenceAddress', '_residencePhone',
'_residenceTypeId', '_residenceYears', '_IsCash', '_CompanyPhone',
'_IdNumber', '_Phone', '_RealName', '_CompanyName', '_Age',
'_Gender', '_OtherWebShopType', '_turnover', '_WebShopTypeId',
'_RelationshipId', '_CompanyAddress', '_Department',
'_flag_UCtoBcp', '_flag_UCtoPVR', '_WorkYears', '_ByUserId',
'_DormitoryPhone', '_IncomeFrom', '_CompanyTypeId',
'_CompanySizeId', '_companyTypeId', '_department',
'_companyAddress', '_workYears', '_contactId', '_creationDate',
'_flag_UCtoBCP', '_orderId', '_phone', '_relationshipId', '_userId',
'_companyName', '_companyPhone', '_isCash', '_BussinessAddress',
'_webShopUrl', '_WebShopUrl', '_SchoolName', '_HasBusinessLicense',
'_dormitoryPhone', '_incomeFrom', '_schoolName', '_NickName',
'_CreationDate', '_CityId', '_DistrictId', '_ProvinceId',
'_GraduateDate', '_GraduateSchool', '_IdAddress', '_companySizeId',
'_HasPPDaiAccount', '_PhoneType', '_PPDaiAccount', '_SecondEmail',
'_SecondMobile', '_nickName', '_HasSbOrGjj', '_Position']
# group的数量
userinfo_num = group.shape[0]
# 修改内容的数量
userinfo_unique_num = group['UserupdateInfo1'].unique().shape[0]
# 修改时间的数量
userinfo_active_day_num = group['UserupdateInfo2'].unique().shape[0]
# 处理修改时间的最小值
min_day = parse_date(np.min(group['UserupdateInfo2']))
# 处理修改时间的最大值
max_day = parse_date(np.max(group['UserupdateInfo2']))
# 最小值和最大值相差几天
gap_day = round((max_day[0] - min_day[0]) / (86400))
indexes = {
'userinfo_num': userinfo_num,
'userinfo_unique_num': userinfo_unique_num,
'userinfo_active_day_num': userinfo_active_day_num,
'userinfo_gap_day': gap_day,
'userinfo_last_day_timestamp': max_day[0]
}
for c in op_columns:
indexes['userinfo' + c + '_num'] = 0
def sub_aggr(sub_group):
return sub_group.shape[0]
sub_group = group.groupby(by=['UserupdateInfo1']).apply(sub_aggr)
for c in sub_group.index:
indexes['userinfo' + c + '_num'] = sub_group.loc[c]
return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
train_userinfo_grouped = train_userinfo.groupby(by=['Idx']).apply(userinfo_aggr)
train_userinfo_grouped.head()
def userinfo_aggr(group):
op_columns = ['_EducationId', '_HasBuyCar', '_LastUpdateDate',
'_MarriageStatusId', '_MobilePhone', '_QQ', '_ResidenceAddress',
'_ResidencePhone', '_ResidenceTypeId', '_ResidenceYears', '_age',
'_educationId', '_gender', '_hasBuyCar', '_idNumber',
'_lastUpdateDate', '_marriageStatusId', '_mobilePhone', '_qQ',
'_realName', '_regStepId', '_residenceAddress', '_residencePhone',
'_residenceTypeId', '_residenceYears', '_IsCash', '_CompanyPhone',
'_IdNumber', '_Phone', '_RealName', '_CompanyName', '_Age',
'_Gender', '_OtherWebShopType', '_turnover', '_WebShopTypeId',
'_RelationshipId', '_CompanyAddress', '_Department',
'_flag_UCtoBcp', '_flag_UCtoPVR', '_WorkYears', '_ByUserId',
'_DormitoryPhone', '_IncomeFrom', '_CompanyTypeId',
'_CompanySizeId', '_companyTypeId', '_department',
'_companyAddress', '_workYears', '_contactId', '_creationDate',
'_flag_UCtoBCP', '_orderId', '_phone', '_relationshipId', '_userId',
'_companyName', '_companyPhone', '_isCash', '_BussinessAddress',
'_webShopUrl', '_WebShopUrl', '_SchoolName', '_HasBusinessLicense',
'_dormitoryPhone', '_incomeFrom', '_schoolName', '_NickName',
'_CreationDate', '_CityId', '_DistrictId', '_ProvinceId',
'_GraduateDate', '_GraduateSchool', '_IdAddress', '_companySizeId',
'_HasPPDaiAccount', '_PhoneType', '_PPDaiAccount', '_SecondEmail',
'_SecondMobile', '_nickName', '_HasSbOrGjj', '_Position']
# group的数量
userinfo_num = group.shape[0]
# 修改内容的数量
userinfo_unique_num = group['UserupdateInfo1'].unique().shape[0]
# 修改时间的数量
userinfo_active_day_num = group['UserupdateInfo2'].unique().shape[0]
# 处理修改时间的最小值
min_day = parse_date(np.min(group['UserupdateInfo2']))
# 处理修改时间的最大值
max_day = parse_date(np.max(group['UserupdateInfo2']))
# 最小值和最大值相差几天
gap_day = round((max_day[0] - min_day[0]) / (86400))
indexes = {
'userinfo_num': userinfo_num,
'userinfo_unique_num': userinfo_unique_num,
'userinfo_active_day_num': userinfo_active_day_num,
'userinfo_gap_day': gap_day,
'userinfo_last_day_timestamp': max_day[0]
}
for c in op_columns:
indexes['userinfo' + c + '_num'] = 0
def sub_aggr(sub_group):
return sub_group.shape[0]
sub_group = group.groupby(by=['UserupdateInfo1']).apply(sub_aggr)
for c in sub_group.index:
indexes['userinfo' + c + '_num'] = sub_group.loc[c]
return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
train_userinfo_grouped = train_userinfo.groupby(by=['Idx']).apply(userinfo_aggr)
train_userinfo_grouped.head()
userinfo_num | userinfo_unique_num | userinfo_active_day_num | userinfo_gap_day | userinfo_last_day_timestamp | userinfo_EducationId_num | userinfo_HasBuyCar_num | userinfo_LastUpdateDate_num | userinfo_MarriageStatusId_num | userinfo_MobilePhone_num | ... | userinfo_IdAddress_num | userinfo_companySizeId_num | userinfo_HasPPDaiAccount_num | userinfo_PhoneType_num | userinfo_PPDaiAccount_num | userinfo_SecondEmail_num | userinfo_SecondMobile_num | userinfo_nickName_num | userinfo_HasSbOrGjj_num | userinfo_Position_num | |
Idx | |||||||||||||||||||||
3 | 13 | 11 | 1 | 0 | 1377820800 | 1 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 13 | 11 | 1 | 0 | 1382572800 | 1 | 1 | 2 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 14 | 12 | 2 | 10 | 1383523200 | 1 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 14 | 14 | 2 | 298 | 1380672000 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 13 | 12 | 2 | 9 | 1383609600 | 1 | 1 | 1 | 1 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 91 columns
train_userinfo_grouped.to_csv('train_userinfo_grouped.csv', header=True, index=True)
train_userinfo_grouped.to_csv('train_userinfo_grouped.csv', header=True, index=True)
train_userinfo_grouped = pd.read_csv('train_userinfo_grouped.csv')
train_userinfo_grouped.head()
train_userinfo_grouped = pd.read_csv('train_userinfo_grouped.csv')
train_userinfo_grouped.head()
Idx | userinfo_num | userinfo_unique_num | userinfo_active_day_num | userinfo_gap_day | userinfo_last_day_timestamp | userinfo_EducationId_num | userinfo_HasBuyCar_num | userinfo_LastUpdateDate_num | userinfo_MarriageStatusId_num | ... | userinfo_IdAddress_num | userinfo_companySizeId_num | userinfo_HasPPDaiAccount_num | userinfo_PhoneType_num | userinfo_PPDaiAccount_num | userinfo_SecondEmail_num | userinfo_SecondMobile_num | userinfo_nickName_num | userinfo_HasSbOrGjj_num | userinfo_Position_num | |
0 | 3 | 13 | 11 | 1 | 0 | 1377820800 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 13 | 11 | 1 | 0 | 1382572800 | 1 | 1 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 8 | 14 | 12 | 2 | 10 | 1383523200 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 12 | 14 | 14 | 2 | 298 | 1380672000 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 16 | 13 | 12 | 2 | 9 | 1383609600 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 92 columns
print('before merge, train_master shape:{}'.format(train_master_.shape))
# train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_index=True)
# train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_index=True)
train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_on='Idx')
train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_on='Idx')
train_master_.fillna(0, inplace=True)
print('after merge, train_master shape:{}'.format(train_master_.shape))
print('before merge, train_master shape:{}'.format(train_master_.shape))
# train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_index=True)
# train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_index=True)
train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_on='Idx')
train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_on='Idx')
train_master_.fillna(0, inplace=True)
print('after merge, train_master shape:{}'.format(train_master_.shape))
before merge, train_master shape:(28074, 230)
after merge, train_master shape:(28074, 327)
one-hot encoding features
这里不要自动推算get_dummies所使用的列,pandas会自动选择object类型,而有些非object feature,实际含义也是categorical的,也需要被one-hot encoding
drop_columns = ['Idx', 'ListingInfo', 'UserInfo_20', 'UserInfo_19', 'UserInfo_8', 'UserInfo_7',
'UserInfo_4','UserInfo_2',
'ListingInfo_timestamp', 'loginfo_last_day_timestamp', 'userinfo_last_day_timestamp']
train_master_ = train_master_.drop(drop_columns, axis=1)
dummy_columns = categorical_features.copy()
dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week',
'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
finally_dummy_columns = []
for c in dummy_columns:
if c not in drop_columns:
finally_dummy_columns.append(c)
print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
print('after get_dummies train_master_ shape {}'.format(train_master_.shape))
drop_columns = ['Idx', 'ListingInfo', 'UserInfo_20', 'UserInfo_19', 'UserInfo_8', 'UserInfo_7',
'UserInfo_4','UserInfo_2',
'ListingInfo_timestamp', 'loginfo_last_day_timestamp', 'userinfo_last_day_timestamp']
train_master_ = train_master_.drop(drop_columns, axis=1)
dummy_columns = categorical_features.copy()
dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week',
'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
finally_dummy_columns = []
for c in dummy_columns:
if c not in drop_columns:
finally_dummy_columns.append(c)
print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
print('after get_dummies train_master_ shape {}'.format(train_master_.shape))
before get_dummies train_master_ shape (28074, 316)
after get_dummies train_master_ shape (28074, 444)
normalized
from sklearn.preprocessing import StandardScaler
X_train = train_master_.drop(['target'], axis=1)
X_train = StandardScaler().fit_transform(X_train)
y_train = train_master_['target']
print(X_train.shape, y_train.shape)
from sklearn.preprocessing import StandardScaler
X_train = train_master_.drop(['target'], axis=1)
X_train = StandardScaler().fit_transform(X_train)
y_train = train_master_['target']
print(X_train.shape, y_train.shape)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\preprocessing\data.py:617: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
(28074, 443) (28074,)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
# from scikitplot import plotters as skplt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC, LinearSVC
# 使用StratifiedKFold分层采样,保证预测target的分布合理,并且shuffle随机
cv = StratifiedKFold(n_splits=3, shuffle=True)
# 计算auc accuracy recall
def estimate(estimator, name='estimator'):
auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()
print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
# skplt.plot_learning_curve(estimator, X_train, y_train)
# plt.show()
# estimator.fit(X_train, y_train)
# y_probas = estimator.predict_proba(X_train)
# skplt.plot_roc_curve(y_true=y_train, y_probas=y_probas)
# plt.show()
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
# from scikitplot import plotters as skplt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC, LinearSVC
# 使用StratifiedKFold分层采样,保证预测target的分布合理,并且shuffle随机
cv = StratifiedKFold(n_splits=3, shuffle=True)
# 计算auc accuracy recall
def estimate(estimator, name='estimator'):
auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()
print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
# skplt.plot_learning_curve(estimator, X_train, y_train)
# plt.show()
# estimator.fit(X_train, y_train)
# y_probas = estimator.predict_proba(X_train)
# skplt.plot_roc_curve(y_true=y_train, y_probas=y_probas)
# plt.show()
estimate(XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic'), 'XGBClassifier')
estimate(RidgeClassifier(), 'RidgeClassifier')
estimate(LogisticRegression(), 'LogisticRegression')
# estimate(RandomForestClassifier(), 'RandomForestClassifier')
estimate(AdaBoostClassifier(), 'AdaBoostClassifier')
# estimate(SVC(), 'SVC')# too long to wait
# estimate(LinearSVC(), 'LinearSVC')
# XGBClassifier: auc:0.747668, recall:0.000000, accuracy:0.944575
# RidgeClassifier: auc:0.754218, recall:0.000000, accuracy:0.944433
# LogisticRegression: auc:0.758454, recall:0.015424, accuracy:0.942010
# AdaBoostClassifier: auc:0.784086, recall:0.013495, accuracy:0.943791
estimate(XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic'), 'XGBClassifier')
estimate(RidgeClassifier(), 'RidgeClassifier')
estimate(LogisticRegression(), 'LogisticRegression')
# estimate(RandomForestClassifier(), 'RandomForestClassifier')
estimate(AdaBoostClassifier(), 'AdaBoostClassifier')
# estimate(SVC(), 'SVC')# too long to wait
# estimate(LinearSVC(), 'LinearSVC')
# XGBClassifier: auc:0.747668, recall:0.000000, accuracy:0.944575
# RidgeClassifier: auc:0.754218, recall:0.000000, accuracy:0.944433
# LogisticRegression: auc:0.758454, recall:0.015424, accuracy:0.942010
# AdaBoostClassifier: auc:0.784086, recall:0.013495, accuracy:0.943791
XGBClassifier: auc:0.755890, recall:0.000000, accuracy:0.944575
RidgeClassifier: auc:0.753939, recall:0.000000, accuracy:0.944575
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
LogisticRegression: auc:0.759646, recall:0.022494, accuracy:0.942438
AdaBoostClassifier: auc:0.792333, recall:0.017988, accuracy:0.943827
VotingClassifier
from sklearn.ensemble import VotingClassifier
estimators = []
# estimators.append(('RidgeClassifier', RidgeClassifier()))
estimators.append(('LogisticRegression', LogisticRegression()))
estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))
# estimators.append(('RandomForestClassifier', RandomForestClassifier()))
#voting: auc:0.794587, recall:0.000642, accuracy:0.944433
voting = VotingClassifier(estimators = estimators, voting='soft')
estimate(voting, 'voting')
from sklearn.ensemble import VotingClassifier
estimators = []
# estimators.append(('RidgeClassifier', RidgeClassifier()))
estimators.append(('LogisticRegression', LogisticRegression()))
estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))
# estimators.append(('RandomForestClassifier', RandomForestClassifier()))
#voting: auc:0.794587, recall:0.000642, accuracy:0.944433
voting = VotingClassifier(estimators = estimators, voting='soft')
estimate(voting, 'voting')
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
voting: auc:0.790281, recall:0.000642, accuracy:0.944361