数据预处理
任务1:对数据进行探索和分析。时间:2天
数据类型的分析
无关特征删除
数据类型转换
缺失值处理
以及你能想到和借鉴的数据分析处理
要求:数据切分方式 - 三七分,其中测试集30%,训练集70%,随机种子设置为2018
# 导入需要的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 读入数据并查看数据信息
data = pd.read_csv('data.csv', encoding='gbk')
print(data.iloc[0])
print('===================')
print(data.shape[1])
Unnamed: 0 5
custid 2791858
trade_no 20180507115231274000000023057383
bank_card_no 卡号1
low_volume_percent 0.01
middle_volume_percent 0.99
take_amount_in_later_12_month_highest 0
trans_amount_increase_rate_lately 0.9
trans_activity_month 0.55
trans_activity_day 0.313
transd_mcc 17
trans_days_interval_filter 27
trans_days_interval 26
regional_mobility 3
student_feature NaN
repayment_capability 19890
is_high_user 0
number_of_trans_from_2011 30
first_transaction_time 2.01308e+07
historical_trans_amount 149050
historical_trans_day 151
rank_trad_1_month 0.4
trans_amount_3_month 34030
avg_consume_less_12_valid_month 7
abs 3920
top_trans_count_last_1_month 0.15
avg_price_last_12_month 1020
avg_price_top_last_12_valid_month 0.55
reg_preference_for_trad 一线城市
trans_top_time_last_1_month 4
...
loans_credibility_behavior 73
loans_count 37
loans_settle_count 34
loans_overdue_count 2
loans_org_count_behavior 10
consfin_org_count_behavior 1
loans_cash_count 9
latest_one_month_loan 1
latest_three_month_loan 1
latest_six_month_loan 13
history_suc_fee 37
history_fail_fee 7
latest_one_month_suc 1
latest_one_month_fail 0
loans_long_time 341
loans_latest_time 2018-04-19
loans_credit_limit 2200
loans_credibility_limit 72
loans_org_count_current 9
loans_product_count 10
loans_max_limit 2900
loans_avg_limit 1688
consfin_credit_limit 1200
consfin_credibility 75
consfin_org_count_current 1
consfin_product_count 2
consfin_max_limit 1200
consfin_avg_limit 1200
latest_query_day 12
loans_latest_day 18
Name: 0, dtype: object
===================
90
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
trans_days_interval_filter 4746 non-null float64
trans_days_interval 4752 non-null float64
regional_mobility 4752 non-null float64
student_feature 1756 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4752 non-null float64
first_transaction_time 4752 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4752 non-null float64
rank_trad_1_month 4752 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4752 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4752 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4650 non-null float64
reg_preference_for_trad 4752 non-null object
trans_top_time_last_1_month 4746 non-null float64
trans_top_time_last_6_month 4746 non-null float64
consume_top_time_last_1_month 4746 non-null float64
consume_top_time_last_6_month 4746 non-null float64
cross_consume_count_last_1_month 4328 non-null float64
trans_fail_top_count_enum_last_1_month 4738 non-null float64
trans_fail_top_count_enum_last_6_month 4738 non-null float64
trans_fail_top_count_enum_last_12_month 4738 non-null float64
consume_mini_time_last_1_month 4728 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4746 non-null float64
railway_consume_count_last_12_month 4742 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4742 non-null float64
status 4754 non-null int64
source 4754 non-null object
first_transaction_day 4752 non-null float64
trans_day_last_12_month 4752 non-null float64
id_name 4478 non-null object
apply_score 4450 non-null float64
apply_credibility 4450 non-null float64
query_org_count 4450 non-null float64
query_finance_count 4450 non-null float64
query_cash_count 4450 non-null float64
query_sum_count 4450 non-null float64
latest_query_time 4450 non-null object
latest_one_month_apply 4450 non-null float64
latest_three_month_apply 4450 non-null float64
latest_six_month_apply 4450 non-null float64
loans_score 4457 non-null float64
loans_credibility_behavior 4457 non-null float64
loans_count 4457 non-null float64
loans_settle_count 4457 non-null float64
loans_overdue_count 4457 non-null float64
loans_org_count_behavior 4457 non-null float64
consfin_org_count_behavior 4457 non-null float64
loans_cash_count 4457 non-null float64
latest_one_month_loan 4457 non-null float64
latest_three_month_loan 4457 non-null float64
latest_six_month_loan 4457 non-null float64
history_suc_fee 4457 non-null float64
history_fail_fee 4457 non-null float64
latest_one_month_suc 4457 non-null float64
latest_one_month_fail 4457 non-null float64
loans_long_time 4457 non-null float64
loans_latest_time 4457 non-null object
loans_credit_limit 4457 non-null float64
loans_credibility_limit 4457 non-null float64
loans_org_count_current 4457 non-null float64
loans_product_count 4457 non-null float64
loans_max_limit 4457 non-null float64
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
通过上述结果可得:数据的种类多,需要去除一些无关数据
分析数据列信息:
Unnamed: 不连续序号,可能代表部分数据删除,先保存
custid: id号,删除
trad_no: 前部分表示的是时间,先保存
bank_card: 卡号1,都相同,删除
student_feature: 缺失率高,删除
source: 都为xs,删除
id_name
latest_query_time
loans_latest_time
useless_columns = ['Unnamed: 0','custid','trade_no','bank_card_no','id_name',
'student_feature', 'source','latest_query_time','loans_latest_time']
data = data.drop(useless_columns,axis=1)
print(data.iloc[0])
print(data.shape[1])
low_volume_percent 0.01
middle_volume_percent 0.99
take_amount_in_later_12_month_highest 0
trans_amount_increase_rate_lately 0.9
trans_activity_month 0.55
trans_activity_day 0.313
transd_mcc 17
trans_days_interval_filter 27
trans_days_interval 26
regional_mobility 3
repayment_capability 19890
is_high_user 0
number_of_trans_from_2011 30
first_transaction_time 2.01308e+07
historical_trans_amount 149050
historical_trans_day 151
rank_trad_1_month 0.4
trans_amount_3_month 34030
avg_consume_less_12_valid_month 7
abs 3920
top_trans_count_last_1_month 0.15
avg_price_last_12_month 1020
avg_price_top_last_12_valid_month 0.55
reg_preference_for_trad 一线城市
trans_top_time_last_1_month 4
trans_top_time_last_6_month 19
consume_top_time_last_1_month 4
consume_top_time_last_6_month 19
cross_consume_count_last_1_month 1
trans_fail_top_count_enum_last_1_month 1
...
loans_score 552
loans_credibility_behavior 73
loans_count 37
loans_settle_count 34
loans_overdue_count 2
loans_org_count_behavior 10
consfin_org_count_behavior 1
loans_cash_count 9
latest_one_month_loan 1
latest_three_month_loan 1
latest_six_month_loan 13
history_suc_fee 37
history_fail_fee 7
latest_one_month_suc 1
latest_one_month_fail 0
loans_long_time 341
loans_credit_limit 2200
loans_credibility_limit 72
loans_org_count_current 9
loans_product_count 10
loans_max_limit 2900
loans_avg_limit 1688
consfin_credit_limit 1200
consfin_credibility 75
consfin_org_count_current 1
consfin_product_count 2
consfin_max_limit 1200
consfin_avg_limit 1200
latest_query_day 12
loans_latest_day 18
Name: 0, dtype: object
81
data.head(10).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
low_volume_percent | 0.01 | 0.02 | 0.04 | 0 | 0.01 | 0.02 | 0.02 | 0.02 | 0.03 | 0.01 |
middle_volume_percent | 0.99 | 0.94 | 0.96 | 0.96 | 0.99 | 0.98 | 0.98 | 0.98 | 0.65 | 0.99 |
take_amount_in_later_12_month_highest | 0 | 2000 | 0 | 2000 | 0 | 2000 | 0 | 0 | 0 | 500 |
trans_amount_increase_rate_lately | 0.9 | 1.28 | 1 | 0.13 | 0.46 | 7.59 | 23.67 | 0.25 | 0.31 | 0.8 |
trans_activity_month | 0.55 | 1 | 1 | 0.57 | 1 | 1 | 0.94 | 0.88 | 0.76 | 1 |
trans_activity_day | 0.313 | 0.458 | 0.114 | 0.777 | 0.175 | 0.733 | 0.087 | 0.302 | 0.472 | 0.088 |
transd_mcc | 17 | 19 | 13 | 22 | 13 | 27 | 10 | 19 | 15 | 15 |
trans_days_interval_filter | 27 | 30 | 68 | 14 | 66 | 8 | 54 | 20 | 21 | 36 |
trans_days_interval | 26 | 14 | 22 | 6 | 42 | 11 | 53 | 20 | 14 | 35 |
regional_mobility | 3 | 4 | 1 | 3 | 1 | 3 | 2 | 2 | 2 | 2 |
repayment_capability | 19890 | 16970 | 9710 | 6210 | 11150 | 8420 | 11220 | 8220 | 69800 | 12510 |
is_high_user | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
number_of_trans_from_2011 | 30 | 23 | 9 | 33 | 12 | 18 | 15 | 30 | 28 | 14 |
first_transaction_time | 2.01308e+07 | 2.01604e+07 | 2.01706e+07 | 2.01305e+07 | 2.01703e+07 | 2.01609e+07 | 2.01611e+07 | 2.01505e+07 | 2.01502e+07 | 2.01701e+07 |
historical_trans_amount | 149050 | 302910 | 11520 | 491130 | 61470 | 416670 | 34140 | 144600 | 172290 | 37250 |
historical_trans_day | 151 | 224 | 31 | 360 | 63 | 317 | 50 | 198 | 253 | 44 |
rank_trad_1_month | 0.4 | 0.35 | 1 | 0.15 | 0.65 | 0.2 | 1 | 0.4 | 0.35 | 0.75 |
trans_amount_3_month | 34030 | 10590 | 5710 | 91690 | 9770 | 78890 | 3080 | 13930 | 21760 | 6640 |
avg_consume_less_12_valid_month | 7 | 5 | 5 | 7 | 6 | 6 | 6 | 8 | 8 | 7 |
abs | 3920 | 6950 | 840 | 46850 | 760 | 22770 | 110 | 880 | 99950 | 110 |
top_trans_count_last_1_month | 0.15 | 0.05 | 0.65 | 0.05 | 1 | 0.05 | 1 | 0.1 | 0.05 | 1 |
avg_price_last_12_month | 1020 | 1210 | 570 | 1290 | 1110 | 1160 | 1250 | 920 | 860 | 1110 |
avg_price_top_last_12_valid_month | 0.55 | 0.5 | 0.65 | 0.45 | 0.5 | 0.5 | NaN | 0.55 | 0.6 | 0.5 |
reg_preference_for_trad | 一线城市 | 一线城市 | 一线城市 | 三线城市 | 一线城市 | 三线城市 | 一线城市 | 一线城市 | 三线城市 | 一线城市 |
trans_top_time_last_1_month | 4 | 13 | 0 | 6 | 0 | 4 | 0 | 6 | 10 | 0 |
trans_top_time_last_6_month | 19 | 30 | 68 | 8 | 66 | 7 | 54 | 20 | 10 | 36 |
consume_top_time_last_1_month | 4 | 13 | 0 | 6 | 0 | 4 | 0 | 6 | 10 | 0 |
consume_top_time_last_6_month | 19 | 30 | 68 | 8 | 66 | 12 | 54 | 20 | 10 | 36 |
cross_consume_count_last_1_month | 1 | 0 | 0 | 0 | 0 | 1 | NaN | 0 | 2 | 0 |
trans_fail_top_count_enum_last_1_month | 1 | 0 | 3 | 1 | 3 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
loans_score | 552 | 635 | 633 | 542 | 479 | 676 | 612 | NaN | 451 | 589 |
loans_credibility_behavior | 73 | 76 | 83 | 75 | 73 | 75 | 75 | NaN | 78 | 74 |
loans_count | 37 | 37 | 4 | 85 | 37 | 60 | 7 | NaN | 24 | 13 |
loans_settle_count | 34 | 36 | 2 | 81 | 32 | 55 | 4 | NaN | 11 | 12 |
loans_overdue_count | 2 | 0 | 0 | 4 | 6 | 0 | 0 | NaN | 11 | 0 |
loans_org_count_behavior | 10 | 17 | 3 | 22 | 12 | 22 | 3 | NaN | 10 | 5 |
consfin_org_count_behavior | 1 | 5 | 1 | 5 | 2 | 10 | 2 | NaN | 8 | 0 |
loans_cash_count | 9 | 12 | 2 | 17 | 10 | 12 | 1 | NaN | 2 | 5 |
latest_one_month_loan | 1 | 2 | 2 | 2 | 0 | 1 | 2 | NaN | 3 | 0 |
latest_three_month_loan | 1 | 2 | 2 | 4 | 0 | 1 | 3 | NaN | 10 | 0 |
latest_six_month_loan | 13 | 8 | 4 | 34 | 10 | 25 | 3 | NaN | 16 | 6 |
history_suc_fee | 37 | 49 | 2 | 91 | 36 | 78 | 9 | NaN | 22 | 13 |
history_fail_fee | 7 | 4 | 2 | 26 | 25 | 12 | 6 | NaN | 53 | 3 |
latest_one_month_suc | 1 | 2 | 1 | 2 | 0 | 4 | 1 | NaN | 0 | 0 |
latest_one_month_fail | 0 | 1 | 1 | 0 | 0 | 0 | 2 | NaN | 21 | 0 |
loans_long_time | 341 | 353 | 157 | 355 | 360 | 360 | 312 | NaN | 316 | 312 |
loans_credit_limit | 2200 | 2000 | 1500 | 1800 | 1800 | 2600 | 2200 | NaN | 4700 | 1900 |
loans_credibility_limit | 72 | 74 | 77 | 74 | 72 | 72 | 76 | NaN | 77 | 74 |
loans_org_count_current | 9 | 12 | 2 | 17 | 10 | 12 | 1 | NaN | 2 | 5 |
loans_product_count | 10 | 12 | 2 | 18 | 10 | 13 | 1 | NaN | 2 | 5 |
loans_max_limit | 2900 | 3500 | 1600 | 3200 | 2300 | 5300 | 2200 | NaN | 5300 | 2800 |
loans_avg_limit | 1688 | 1758 | 1250 | 1541 | 1630 | 1941 | 2200 | NaN | 4750 | 1520 |
consfin_credit_limit | 1200 | 15100 | 4200 | 16300 | 8300 | 11200 | 7600 | NaN | 5500 | 0 |
consfin_credibility | 75 | 80 | 87 | 80 | 79 | 80 | 73 | NaN | 79 | 0 |
consfin_org_count_current | 1 | 5 | 1 | 5 | 2 | 10 | 2 | NaN | 8 | 0 |
consfin_product_count | 2 | 6 | 1 | 5 | 2 | 12 | 2 | NaN | 11 | 0 |
consfin_max_limit | 1200 | 22800 | 4200 | 30000 | 8400 | 20400 | 16800 | NaN | 19200 | 0 |
consfin_avg_limit | 1200 | 9360 | 4200 | 12180 | 8250 | 8130 | 8900 | NaN | 7987 | 0 |
latest_query_day | 12 | 4 | 2 | 2 | 22 | 3 | 1 | NaN | 24 | 18 |
loans_latest_day | 18 | 2 | 6 | 4 | 120 | 4 | 3 | NaN | 7 | 142 |
81 rows × 10 columns
data['status'].value_counts()
0 3561
1 1193
Name: status, dtype: int64
# 删除数据相同的列
orig_columns = data.columns
drop_columns = []
for col in orig_columns:
col_series = data[col].dropna().unique()
if len(col_series) == 1:
drop_columns.append(col)
data = data.drop(drop_columns, axis=1)
print(drop_columns)
[]
# 查看每列缺失值情况
null_counts = data.isnull().sum()
null_counts
low_volume_percent 2
middle_volume_percent 2
take_amount_in_later_12_month_highest 0
trans_amount_increase_rate_lately 3
trans_activity_month 2
trans_activity_day 2
transd_mcc 2
trans_days_interval_filter 8
trans_days_interval 2
regional_mobility 2
repayment_capability 0
is_high_user 0
number_of_trans_from_2011 2
first_transaction_time 2
historical_trans_amount 0
historical_trans_day 2
rank_trad_1_month 2
trans_amount_3_month 0
avg_consume_less_12_valid_month 2
abs 0
top_trans_count_last_1_month 2
avg_price_last_12_month 0
avg_price_top_last_12_valid_month 104
reg_preference_for_trad 2
trans_top_time_last_1_month 8
trans_top_time_last_6_month 8
consume_top_time_last_1_month 8
consume_top_time_last_6_month 8
cross_consume_count_last_1_month 426
trans_fail_top_count_enum_last_1_month 16
...
loans_score 297
loans_credibility_behavior 297
loans_count 297
loans_settle_count 297
loans_overdue_count 297
loans_org_count_behavior 297
consfin_org_count_behavior 297
loans_cash_count 297
latest_one_month_loan 297
latest_three_month_loan 297
latest_six_month_loan 297
history_suc_fee 297
history_fail_fee 297
latest_one_month_suc 297
latest_one_month_fail 297
loans_long_time 297
loans_credit_limit 297
loans_credibility_limit 297
loans_org_count_current 297
loans_product_count 297
loans_max_limit 297
loans_avg_limit 297
consfin_credit_limit 297
consfin_credibility 297
consfin_org_count_current 297
consfin_product_count 297
consfin_max_limit 297
consfin_avg_limit 297
latest_query_day 304
loans_latest_day 297
dtype: int64
# 查看数据类型
data.dtypes.value_counts()
float64 69
int64 11
object 1
dtype: int64
# 处理非数值型数据
object_columns_df = data.select_dtypes(include=['object'])
object_columns_df.iloc[0]
reg_preference_for_trad 一线城市
Name: 0, dtype: object
data['reg_preference_for_trad'].unique()
array(['一线城市', '三线城市', '境外', '二线城市', '其他城市', nan], dtype=object)
# 数据映射
mapping_dict = {
'reg_preference_for_trad': {
'一线城市': 1,
'二线城市': 2,
'三线城市': 3,
'其他城市': 4,
'境外': 0
}
}
data = data.replace(mapping_dict)
data['reg_preference_for_trad'].unique()
array([ 1., 3., 0., 2., 4., nan])
data.head(10)
low_volume_percent | middle_volume_percent | take_amount_in_later_12_month_highest | trans_amount_increase_rate_lately | trans_activity_month | trans_activity_day | transd_mcc | trans_days_interval_filter | trans_days_interval | regional_mobility | ... | loans_max_limit | loans_avg_limit | consfin_credit_limit | consfin_credibility | consfin_org_count_current | consfin_product_count | consfin_max_limit | consfin_avg_limit | latest_query_day | loans_latest_day | |
0 | 0.01 | 0.99 | 0 | 0.90 | 0.55 | 0.313 | 17.0 | 27.0 | 26.0 | 3.0 | ... | 2900.0 | 1688.0 | 1200.0 | 75.0 | 1.0 | 2.0 | 1200.0 | 1200.0 | 12.0 | 18.0 |
1 | 0.02 | 0.94 | 2000 | 1.28 | 1.00 | 0.458 | 19.0 | 30.0 | 14.0 | 4.0 | ... | 3500.0 | 1758.0 | 15100.0 | 80.0 | 5.0 | 6.0 | 22800.0 | 9360.0 | 4.0 | 2.0 |
2 | 0.04 | 0.96 | 0 | 1.00 | 1.00 | 0.114 | 13.0 | 68.0 | 22.0 | 1.0 | ... | 1600.0 | 1250.0 | 4200.0 | 87.0 | 1.0 | 1.0 | 4200.0 | 4200.0 | 2.0 | 6.0 |
3 | 0.00 | 0.96 | 2000 | 0.13 | 0.57 | 0.777 | 22.0 | 14.0 | 6.0 | 3.0 | ... | 3200.0 | 1541.0 | 16300.0 | 80.0 | 5.0 | 5.0 | 30000.0 | 12180.0 | 2.0 | 4.0 |
4 | 0.01 | 0.99 | 0 | 0.46 | 1.00 | 0.175 | 13.0 | 66.0 | 42.0 | 1.0 | ... | 2300.0 | 1630.0 | 8300.0 | 79.0 | 2.0 | 2.0 | 8400.0 | 8250.0 | 22.0 | 120.0 |
5 | 0.02 | 0.98 | 2000 | 7.59 | 1.00 | 0.733 | 27.0 | 8.0 | 11.0 | 3.0 | ... | 5300.0 | 1941.0 | 11200.0 | 80.0 | 10.0 | 12.0 | 20400.0 | 8130.0 | 3.0 | 4.0 |
6 | 0.02 | 0.98 | 0 | 23.67 | 0.94 | 0.087 | 10.0 | 54.0 | 53.0 | 2.0 | ... | 2200.0 | 2200.0 | 7600.0 | 73.0 | 2.0 | 2.0 | 16800.0 | 8900.0 | 1.0 | 3.0 |
7 | 0.02 | 0.98 | 0 | 0.25 | 0.88 | 0.302 | 19.0 | 20.0 | 20.0 | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 0.03 | 0.65 | 0 | 0.31 | 0.76 | 0.472 | 15.0 | 21.0 | 14.0 | 2.0 | ... | 5300.0 | 4750.0 | 5500.0 | 79.0 | 8.0 | 11.0 | 19200.0 | 7987.0 | 24.0 | 7.0 |
9 | 0.01 | 0.99 | 500 | 0.80 | 1.00 | 0.088 | 15.0 | 36.0 | 35.0 | 2.0 | ... | 2800.0 | 1520.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 18.0 | 142.0 |
10 rows × 81 columns
# 划分数据集和测试集
y = data['status']
x = data.drop(['status'], axis=1)
X_train,X_test,y_train,y_test=train_test_split(x, y, test_size=0.3, random_state=2018)
print("Training Size:{}".format(X_train.shape))
print('Testing Size:{}'.format(X_test.shape))
Training Size:(3327, 80)
Testing Size:(1427, 80)
问题讨论
1. 缺失数据
- 为什么需要处理缺失值?
- 缺失率大于多少时应当抛弃该特征?
- 缺失值填充有哪些方法?
- 采用均值填充的影响或者优缺点?
- 需要依据什么样的准则去选择合适的方法?
1 为什么需要处理缺失值?
不同缺失值的处理方式在一定程度上影响了特征提取、建模和模型训练,缺失值太多,可以尝试着直接删除,如果不删除,处理不好,可能会引来噪声缺失值较少,少于某一缺失率时,直接删除又会带来信息的损失,此时可以采取适当的填充方式
2 缺失率大于多少时应当抛弃该特征?
一般为70%,但是还要分析该特征与训练目标的重要程度
3 缺失值填充有哪些方法?
(1) 插值填充
- 特殊值,均值、中位数、众数等
(2) 插补法
- 随机插补法----从总体中随机抽取某个样本代替缺失样本
- 多重插补法----通过变量之间的关系对缺失数据进行预测,利用蒙特卡洛方法生成多个完整的数据集,在对这些数据集进行分析,最后对分析结果进行汇总处理
- 热平台插补----指在非缺失数据集中找到一个与缺失值所在样本相似的样本(匹配样本),利用其中的观测值对缺失值进行插补
- 拉格朗日差值法和牛顿插值法
4 采用均值填充的影响或者优缺点?
缺点:大大降低数据的方差,即随机性
5 需要依据什么样的准则去选择合适的方法?
(1) 删除
- 如果行和列的缺失达到一定的比例,建议放弃整行或整列数据
(2) 插补
- 列的维度上,如果是连续性,就使用平均值插补,如果是离散性,就使用众数来插补
- 行的维度上,引入预测模型,可考虑辅助回归,通过变量间的关系来预测缺失数据
(3) 不处理
- 当缺失值对模型的影响不大时,直接在包含空值的数据上进行数据挖掘,很多模型对于缺失值有容忍度或灵活的处理方法,常见的能够自动处理缺失值的模型包括:KNN、决策树和随机森林、神经网络和朴素贝叶斯、DBSCAN
2. 数据探索
对于字段较少的情况下经常使用绘图来更直观的观察数据的分布,进而对数据进行针对性的处理;但是再字段量较多的情况下一个一个字段去绘图会比较费时间,那应该用怎么的顺序逻辑对字段进行处理?
这个情况在银行业普遍存在,当然,其他领域估计也会有。以我个人经历,在实际生产中会有一张表超过300个字段的情况,哪些字段该要哪些不该要确实比较麻烦。我采取的方式是首先去判断哪些字段值重复率较高,这个通过sql语句group by可以直接看出来。其次把数据通过spss对每一个特征进行分析,是绘图还是简单的分析,软件里面都有提供,基本上通过上面两步保证百分之七八十吧,如果仅仅是是在数据探索阶段的话,基本上就完成了
3.时间序列
时间序列应该怎么处理?除了提取天数还能做什么处理?
依情况处理,主要看单独时间字段或时间字段与某些字段的组合属性对目标分析的作用程度,再采取相应方式来进行特征提取比如:可以将时间字段与其他字段属性进行组合,分析每天、每周、每月或特点星期几等情况下特征数据频率信息,总的来说还是得看分析得目标
了考虑时间序列这个单独的特征外,往往是将时间序列和具有时间属性的特征联合起来分析,查看组合特征的对所需要分析的内容的影响
4.异常值和离群值
怎样判断离群值以及是否需要删除离群值或怎样替代离群值?(比如一些手动录入过程中出错产生的离群值等)
离群点可以用分位数
看与平均值的偏差超过几倍标准差
LOF算法
describe的时候加一个 箱型图
大多数的参数统计数值,如均值、标准差、相关系数等,以及基于这些参数的统计分析,均对离群值高度敏感。因此,离群值的存在会对数据分析造成极大影响
可以参考博文: https://www.jianshu.com/p/0c967a1526ef
5.分类数据的编码
这里城市的分类显然不适合用独热码编码了,那么如果在其它时候使用sklearn.preprocessing中的LabelBinarizer后重新编码的文本特征又怎样应用到预测中?
one-hot:彼此之间没关系
dummy: 有顺序性的特征
试想编码出来只是数字,建模的时候本质上计算的是类别间的距离
所以我们就需要根据实际情况去判断,我们的类别间到底存不存在距离关系
6.类别不平衡问题
举实际场景例子:
1.信用卡欺诈 可能几百几千个客户里面才会有一个欺诈情况出现
2.网页广告点击率,这个比例更夸张
3.医生病情误判情况
处理数据不平衡问题
- 改变数据量大小,使类别间变得平衡
- 不改变数据量,设置成本矩阵或代价函数来限定
以上问题是大家提出问题并一起讨论的解决办法,收获颇多,希望之后能够多多参与一起讨论、学习
下面是负责人整理的资料
缺失值处理方法: 数据编码:https://mp.weixin.qq.com/s/U93vvFwZ8vSJuswk24yc6w
数据不平衡问题
数据不平衡问题
what?
通常出现在类别问题中,类别间数据量差异过大
例如:10W正例,100反例
why?
为什么会出现
- 收集的数据量不多,不全面
- 数据集本身特点,如信用卡欺诈、用户投诉、机器故障数据,可能1K个顾客里面,只有1个会欺诈。
为什么需要处理
- 类别间的不平衡,会很容易导致模型预测偏移。比如信用卡欺诈,直接跑决策树,可能预测准确率会高达99.99%不会欺诈。这样的结果并没有意义。
- 所以像这种类别不平衡的,我们需要采取一些策略去调整数据。
how?
目标:
- 改变数据量,使类别间数据量尽量相等。 正例数据量 = 反例数据量
- 数据量不变,通过采用不同模型或者评估方法来消除不平衡的影响
策略:
- 1. 扩大数据集。以期得到更多的分布信息;同时扩大数据量后也方便后面的重新采样。
- 2. 尝试新角度理解问题。我们可以把那些小类的样本作为异常点(outliers),因此该问题便转化为异常点检测(anomaly detection)与变化趋势检测问题(change detection)。
异常点检测即是对那些罕见事件进行识别。如通过机器的部件的振动识别机器故障,又如通过系统调用序列识别恶意程序。这些事件相对于正常情况是很少见的。
变化趋势检测类似于异常点检测,不同在于其通过检测不寻常的变化趋势来识别。如通过观察用户模式或银行交易来检测用户行为的不寻常改变。
Sampling methods(采样方法)
增加偏少的类别 - OverSampling过采样
① 简单随机重复。
② SMOTE。在相同边界内,人工造数据。
减少偏多的类别 - UnderSampling欠采样
① 简单随机抽样。
② 带边界清理。尽量保留分布信息。
采用集成学习思想:
把多数类进行划分,然后和少数类组合成多个小的训练集,然后生成学习器,最后再集成。