数据预处理

任务1:对数据进行探索和分析。时间:2天
数据类型的分析
无关特征删除
数据类型转换
缺失值处理
以及你能想到和借鉴的数据分析处理
要求:数据切分方式 - 三七分,其中测试集30%,训练集70%,随机种子设置为2018

# 导入需要的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 读入数据并查看数据信息
data = pd.read_csv('data.csv', encoding='gbk')
print(data.iloc[0])
print('===================')
print(data.shape[1])
Unnamed: 0                                                              5
custid                                                            2791858
trade_no                                 20180507115231274000000023057383
bank_card_no                                                          卡号1
low_volume_percent                                                   0.01
middle_volume_percent                                                0.99
take_amount_in_later_12_month_highest                                   0
trans_amount_increase_rate_lately                                     0.9
trans_activity_month                                                 0.55
trans_activity_day                                                  0.313
transd_mcc                                                             17
trans_days_interval_filter                                             27
trans_days_interval                                                    26
regional_mobility                                                       3
student_feature                                                       NaN
repayment_capability                                                19890
is_high_user                                                            0
number_of_trans_from_2011                                              30
first_transaction_time                                        2.01308e+07
historical_trans_amount                                            149050
historical_trans_day                                                  151
rank_trad_1_month                                                     0.4
trans_amount_3_month                                                34030
avg_consume_less_12_valid_month                                         7
abs                                                                  3920
top_trans_count_last_1_month                                         0.15
avg_price_last_12_month                                              1020
avg_price_top_last_12_valid_month                                    0.55
reg_preference_for_trad                                              一线城市
trans_top_time_last_1_month                                             4
                                                       ...               
loans_credibility_behavior                                             73
loans_count                                                            37
loans_settle_count                                                     34
loans_overdue_count                                                     2
loans_org_count_behavior                                               10
consfin_org_count_behavior                                              1
loans_cash_count                                                        9
latest_one_month_loan                                                   1
latest_three_month_loan                                                 1
latest_six_month_loan                                                  13
history_suc_fee                                                        37
history_fail_fee                                                        7
latest_one_month_suc                                                    1
latest_one_month_fail                                                   0
loans_long_time                                                       341
loans_latest_time                                              2018-04-19
loans_credit_limit                                                   2200
loans_credibility_limit                                                72
loans_org_count_current                                                 9
loans_product_count                                                    10
loans_max_limit                                                      2900
loans_avg_limit                                                      1688
consfin_credit_limit                                                 1200
consfin_credibility                                                    75
consfin_org_count_current                                               1
consfin_product_count                                                   2
consfin_max_limit                                                    1200
consfin_avg_limit                                                    1200
latest_query_day                                                       12
loans_latest_day                                                       18
Name: 0, dtype: object
===================
90
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB

通过上述结果可得:数据的种类多,需要去除一些无关数据
分析数据列信息:
Unnamed: 不连续序号,可能代表部分数据删除,先保存
custid: id号,删除
trad_no: 前部分表示的是时间,先保存
bank_card: 卡号1,都相同,删除
student_feature: 缺失率高,删除
source: 都为xs,删除
id_name
latest_query_time
loans_latest_time

useless_columns = ['Unnamed: 0','custid','trade_no','bank_card_no','id_name',
                   'student_feature', 'source','latest_query_time','loans_latest_time']
data = data.drop(useless_columns,axis=1)
print(data.iloc[0])
print(data.shape[1])
low_volume_percent                               0.01
middle_volume_percent                            0.99
take_amount_in_later_12_month_highest               0
trans_amount_increase_rate_lately                 0.9
trans_activity_month                             0.55
trans_activity_day                              0.313
transd_mcc                                         17
trans_days_interval_filter                         27
trans_days_interval                                26
regional_mobility                                   3
repayment_capability                            19890
is_high_user                                        0
number_of_trans_from_2011                          30
first_transaction_time                    2.01308e+07
historical_trans_amount                        149050
historical_trans_day                              151
rank_trad_1_month                                 0.4
trans_amount_3_month                            34030
avg_consume_less_12_valid_month                     7
abs                                              3920
top_trans_count_last_1_month                     0.15
avg_price_last_12_month                          1020
avg_price_top_last_12_valid_month                0.55
reg_preference_for_trad                          一线城市
trans_top_time_last_1_month                         4
trans_top_time_last_6_month                        19
consume_top_time_last_1_month                       4
consume_top_time_last_6_month                      19
cross_consume_count_last_1_month                    1
trans_fail_top_count_enum_last_1_month              1
                                             ...     
loans_score                                       552
loans_credibility_behavior                         73
loans_count                                        37
loans_settle_count                                 34
loans_overdue_count                                 2
loans_org_count_behavior                           10
consfin_org_count_behavior                          1
loans_cash_count                                    9
latest_one_month_loan                               1
latest_three_month_loan                             1
latest_six_month_loan                              13
history_suc_fee                                    37
history_fail_fee                                    7
latest_one_month_suc                                1
latest_one_month_fail                               0
loans_long_time                                   341
loans_credit_limit                               2200
loans_credibility_limit                            72
loans_org_count_current                             9
loans_product_count                                10
loans_max_limit                                  2900
loans_avg_limit                                  1688
consfin_credit_limit                             1200
consfin_credibility                                75
consfin_org_count_current                           1
consfin_product_count                               2
consfin_max_limit                                1200
consfin_avg_limit                                1200
latest_query_day                                   12
loans_latest_day                                   18
Name: 0, dtype: object
81
data.head(10).T


0

1

2

3

4

5

6

7

8

9

low_volume_percent

0.01

0.02

0.04

0

0.01

0.02

0.02

0.02

0.03

0.01

middle_volume_percent

0.99

0.94

0.96

0.96

0.99

0.98

0.98

0.98

0.65

0.99

take_amount_in_later_12_month_highest

0

2000

0

2000

0

2000

0

0

0

500

trans_amount_increase_rate_lately

0.9

1.28

1

0.13

0.46

7.59

23.67

0.25

0.31

0.8

trans_activity_month

0.55

1

1

0.57

1

1

0.94

0.88

0.76

1

trans_activity_day

0.313

0.458

0.114

0.777

0.175

0.733

0.087

0.302

0.472

0.088

transd_mcc

17

19

13

22

13

27

10

19

15

15

trans_days_interval_filter

27

30

68

14

66

8

54

20

21

36

trans_days_interval

26

14

22

6

42

11

53

20

14

35

regional_mobility

3

4

1

3

1

3

2

2

2

2

repayment_capability

19890

16970

9710

6210

11150

8420

11220

8220

69800

12510

is_high_user

0

0

0

0

0

0

0

0

0

0

number_of_trans_from_2011

30

23

9

33

12

18

15

30

28

14

first_transaction_time

2.01308e+07

2.01604e+07

2.01706e+07

2.01305e+07

2.01703e+07

2.01609e+07

2.01611e+07

2.01505e+07

2.01502e+07

2.01701e+07

historical_trans_amount

149050

302910

11520

491130

61470

416670

34140

144600

172290

37250

historical_trans_day

151

224

31

360

63

317

50

198

253

44

rank_trad_1_month

0.4

0.35

1

0.15

0.65

0.2

1

0.4

0.35

0.75

trans_amount_3_month

34030

10590

5710

91690

9770

78890

3080

13930

21760

6640

avg_consume_less_12_valid_month

7

5

5

7

6

6

6

8

8

7

abs

3920

6950

840

46850

760

22770

110

880

99950

110

top_trans_count_last_1_month

0.15

0.05

0.65

0.05

1

0.05

1

0.1

0.05

1

avg_price_last_12_month

1020

1210

570

1290

1110

1160

1250

920

860

1110

avg_price_top_last_12_valid_month

0.55

0.5

0.65

0.45

0.5

0.5

NaN

0.55

0.6

0.5

reg_preference_for_trad

一线城市

一线城市

一线城市

三线城市

一线城市

三线城市

一线城市

一线城市

三线城市

一线城市

trans_top_time_last_1_month

4

13

0

6

0

4

0

6

10

0

trans_top_time_last_6_month

19

30

68

8

66

7

54

20

10

36

consume_top_time_last_1_month

4

13

0

6

0

4

0

6

10

0

consume_top_time_last_6_month

19

30

68

8

66

12

54

20

10

36

cross_consume_count_last_1_month

1

0

0

0

0

1

NaN

0

2

0

trans_fail_top_count_enum_last_1_month

1

0

3

1

3

0

0

1

0

0

...

...

...

...

...

...

...

...

...

...

...

loans_score

552

635

633

542

479

676

612

NaN

451

589

loans_credibility_behavior

73

76

83

75

73

75

75

NaN

78

74

loans_count

37

37

4

85

37

60

7

NaN

24

13

loans_settle_count

34

36

2

81

32

55

4

NaN

11

12

loans_overdue_count

2

0

0

4

6

0

0

NaN

11

0

loans_org_count_behavior

10

17

3

22

12

22

3

NaN

10

5

consfin_org_count_behavior

1

5

1

5

2

10

2

NaN

8

0

loans_cash_count

9

12

2

17

10

12

1

NaN

2

5

latest_one_month_loan

1

2

2

2

0

1

2

NaN

3

0

latest_three_month_loan

1

2

2

4

0

1

3

NaN

10

0

latest_six_month_loan

13

8

4

34

10

25

3

NaN

16

6

history_suc_fee

37

49

2

91

36

78

9

NaN

22

13

history_fail_fee

7

4

2

26

25

12

6

NaN

53

3

latest_one_month_suc

1

2

1

2

0

4

1

NaN

0

0

latest_one_month_fail

0

1

1

0

0

0

2

NaN

21

0

loans_long_time

341

353

157

355

360

360

312

NaN

316

312

loans_credit_limit

2200

2000

1500

1800

1800

2600

2200

NaN

4700

1900

loans_credibility_limit

72

74

77

74

72

72

76

NaN

77

74

loans_org_count_current

9

12

2

17

10

12

1

NaN

2

5

loans_product_count

10

12

2

18

10

13

1

NaN

2

5

loans_max_limit

2900

3500

1600

3200

2300

5300

2200

NaN

5300

2800

loans_avg_limit

1688

1758

1250

1541

1630

1941

2200

NaN

4750

1520

consfin_credit_limit

1200

15100

4200

16300

8300

11200

7600

NaN

5500

0

consfin_credibility

75

80

87

80

79

80

73

NaN

79

0

consfin_org_count_current

1

5

1

5

2

10

2

NaN

8

0

consfin_product_count

2

6

1

5

2

12

2

NaN

11

0

consfin_max_limit

1200

22800

4200

30000

8400

20400

16800

NaN

19200

0

consfin_avg_limit

1200

9360

4200

12180

8250

8130

8900

NaN

7987

0

latest_query_day

12

4

2

2

22

3

1

NaN

24

18

loans_latest_day

18

2

6

4

120

4

3

NaN

7

142

81 rows × 10 columns


data['status'].value_counts()
0    3561
1    1193
Name: status, dtype: int64
# 删除数据相同的列
orig_columns = data.columns
drop_columns = []
for col in orig_columns:
    col_series = data[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
data = data.drop(drop_columns, axis=1)
print(drop_columns)
[]
# 查看每列缺失值情况
null_counts = data.isnull().sum()
null_counts
low_volume_percent                          2
middle_volume_percent                       2
take_amount_in_later_12_month_highest       0
trans_amount_increase_rate_lately           3
trans_activity_month                        2
trans_activity_day                          2
transd_mcc                                  2
trans_days_interval_filter                  8
trans_days_interval                         2
regional_mobility                           2
repayment_capability                        0
is_high_user                                0
number_of_trans_from_2011                   2
first_transaction_time                      2
historical_trans_amount                     0
historical_trans_day                        2
rank_trad_1_month                           2
trans_amount_3_month                        0
avg_consume_less_12_valid_month             2
abs                                         0
top_trans_count_last_1_month                2
avg_price_last_12_month                     0
avg_price_top_last_12_valid_month         104
reg_preference_for_trad                     2
trans_top_time_last_1_month                 8
trans_top_time_last_6_month                 8
consume_top_time_last_1_month               8
consume_top_time_last_6_month               8
cross_consume_count_last_1_month          426
trans_fail_top_count_enum_last_1_month     16
                                         ... 
loans_score                               297
loans_credibility_behavior                297
loans_count                               297
loans_settle_count                        297
loans_overdue_count                       297
loans_org_count_behavior                  297
consfin_org_count_behavior                297
loans_cash_count                          297
latest_one_month_loan                     297
latest_three_month_loan                   297
latest_six_month_loan                     297
history_suc_fee                           297
history_fail_fee                          297
latest_one_month_suc                      297
latest_one_month_fail                     297
loans_long_time                           297
loans_credit_limit                        297
loans_credibility_limit                   297
loans_org_count_current                   297
loans_product_count                       297
loans_max_limit                           297
loans_avg_limit                           297
consfin_credit_limit                      297
consfin_credibility                       297
consfin_org_count_current                 297
consfin_product_count                     297
consfin_max_limit                         297
consfin_avg_limit                         297
latest_query_day                          304
loans_latest_day                          297
dtype: int64
# 查看数据类型
data.dtypes.value_counts()
float64    69
int64      11
object      1
dtype: int64
# 处理非数值型数据
object_columns_df = data.select_dtypes(include=['object'])
object_columns_df.iloc[0]
reg_preference_for_trad    一线城市
Name: 0, dtype: object
data['reg_preference_for_trad'].unique()
array(['一线城市', '三线城市', '境外', '二线城市', '其他城市', nan], dtype=object)
# 数据映射
mapping_dict = {
    'reg_preference_for_trad': {
        '一线城市': 1,
        '二线城市': 2,
        '三线城市': 3,
        '其他城市': 4,
        '境外': 0
    }
}
data = data.replace(mapping_dict)
data['reg_preference_for_trad'].unique()
array([  1.,   3.,   0.,   2.,   4.,  nan])
data.head(10)


low_volume_percent

middle_volume_percent

take_amount_in_later_12_month_highest

trans_amount_increase_rate_lately

trans_activity_month

trans_activity_day

transd_mcc

trans_days_interval_filter

trans_days_interval

regional_mobility

...

loans_max_limit

loans_avg_limit

consfin_credit_limit

consfin_credibility

consfin_org_count_current

consfin_product_count

consfin_max_limit

consfin_avg_limit

latest_query_day

loans_latest_day

0

0.01

0.99

0

0.90

0.55

0.313

17.0

27.0

26.0

3.0

...

2900.0

1688.0

1200.0

75.0

1.0

2.0

1200.0

1200.0

12.0

18.0

1

0.02

0.94

2000

1.28

1.00

0.458

19.0

30.0

14.0

4.0

...

3500.0

1758.0

15100.0

80.0

5.0

6.0

22800.0

9360.0

4.0

2.0

2

0.04

0.96

0

1.00

1.00

0.114

13.0

68.0

22.0

1.0

...

1600.0

1250.0

4200.0

87.0

1.0

1.0

4200.0

4200.0

2.0

6.0

3

0.00

0.96

2000

0.13

0.57

0.777

22.0

14.0

6.0

3.0

...

3200.0

1541.0

16300.0

80.0

5.0

5.0

30000.0

12180.0

2.0

4.0

4

0.01

0.99

0

0.46

1.00

0.175

13.0

66.0

42.0

1.0

...

2300.0

1630.0

8300.0

79.0

2.0

2.0

8400.0

8250.0

22.0

120.0

5

0.02

0.98

2000

7.59

1.00

0.733

27.0

8.0

11.0

3.0

...

5300.0

1941.0

11200.0

80.0

10.0

12.0

20400.0

8130.0

3.0

4.0

6

0.02

0.98

0

23.67

0.94

0.087

10.0

54.0

53.0

2.0

...

2200.0

2200.0

7600.0

73.0

2.0

2.0

16800.0

8900.0

1.0

3.0

7

0.02

0.98

0

0.25

0.88

0.302

19.0

20.0

20.0

2.0

...

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

8

0.03

0.65

0

0.31

0.76

0.472

15.0

21.0

14.0

2.0

...

5300.0

4750.0

5500.0

79.0

8.0

11.0

19200.0

7987.0

24.0

7.0

9

0.01

0.99

500

0.80

1.00

0.088

15.0

36.0

35.0

2.0

...

2800.0

1520.0

0.0

0.0

0.0

0.0

0.0

0.0

18.0

142.0

10 rows × 81 columns


# 划分数据集和测试集
y = data['status']
x = data.drop(['status'], axis=1)
X_train,X_test,y_train,y_test=train_test_split(x, y, test_size=0.3, random_state=2018)
print("Training Size:{}".format(X_train.shape))
print('Testing Size:{}'.format(X_test.shape))
Training Size:(3327, 80)
Testing Size:(1427, 80)

问题讨论

1. 缺失数据
  • 为什么需要处理缺失值?
  • 缺失率大于多少时应当抛弃该特征?
  • 缺失值填充有哪些方法?
  • 采用均值填充的影响或者优缺点?
  • 需要依据什么样的准则去选择合适的方法?

1 为什么需要处理缺失值?
不同缺失值的处理方式在一定程度上影响了特征提取、建模和模型训练,缺失值太多,可以尝试着直接删除,如果不删除,处理不好,可能会引来噪声缺失值较少,少于某一缺失率时,直接删除又会带来信息的损失,此时可以采取适当的填充方式

2 缺失率大于多少时应当抛弃该特征?
一般为70%,但是还要分析该特征与训练目标的重要程度

3 缺失值填充有哪些方法?
(1) 插值填充
- 特殊值,均值、中位数、众数等
(2) 插补法
- 随机插补法----从总体中随机抽取某个样本代替缺失样本
- 多重插补法----通过变量之间的关系对缺失数据进行预测,利用蒙特卡洛方法生成多个完整的数据集,在对这些数据集进行分析,最后对分析结果进行汇总处理
- 热平台插补----指在非缺失数据集中找到一个与缺失值所在样本相似的样本(匹配样本),利用其中的观测值对缺失值进行插补
- 拉格朗日差值法和牛顿插值法

4 采用均值填充的影响或者优缺点?
缺点:大大降低数据的方差,即随机性

5 需要依据什么样的准则去选择合适的方法?
(1) 删除
- 如果行和列的缺失达到一定的比例,建议放弃整行或整列数据
(2) 插补
- 列的维度上,如果是连续性,就使用平均值插补,如果是离散性,就使用众数来插补
- 行的维度上,引入预测模型,可考虑辅助回归,通过变量间的关系来预测缺失数据
(3) 不处理
- 当缺失值对模型的影响不大时,直接在包含空值的数据上进行数据挖掘,很多模型对于缺失值有容忍度或灵活的处理方法,常见的能够自动处理缺失值的模型包括:KNN、决策树和随机森林、神经网络和朴素贝叶斯、DBSCAN

2. 数据探索

对于字段较少的情况下经常使用绘图来更直观的观察数据的分布,进而对数据进行针对性的处理;但是再字段量较多的情况下一个一个字段去绘图会比较费时间,那应该用怎么的顺序逻辑对字段进行处理?

这个情况在银行业普遍存在,当然,其他领域估计也会有。以我个人经历,在实际生产中会有一张表超过300个字段的情况,哪些字段该要哪些不该要确实比较麻烦。我采取的方式是首先去判断哪些字段值重复率较高,这个通过sql语句group by可以直接看出来。其次把数据通过spss对每一个特征进行分析,是绘图还是简单的分析,软件里面都有提供,基本上通过上面两步保证百分之七八十吧,如果仅仅是是在数据探索阶段的话,基本上就完成了

3.时间序列

时间序列应该怎么处理?除了提取天数还能做什么处理?

依情况处理,主要看单独时间字段或时间字段与某些字段的组合属性对目标分析的作用程度,再采取相应方式来进行特征提取比如:可以将时间字段与其他字段属性进行组合,分析每天、每周、每月或特点星期几等情况下特征数据频率信息,总的来说还是得看分析得目标

了考虑时间序列这个单独的特征外,往往是将时间序列和具有时间属性的特征联合起来分析,查看组合特征的对所需要分析的内容的影响

4.异常值和离群值

怎样判断离群值以及是否需要删除离群值或怎样替代离群值?(比如一些手动录入过程中出错产生的离群值等)

离群点可以用分位数
看与平均值的偏差超过几倍标准差
LOF算法
describe的时候加一个 箱型图

大多数的参数统计数值,如均值、标准差、相关系数等,以及基于这些参数的统计分析,均对离群值高度敏感。因此,离群值的存在会对数据分析造成极大影响

可以参考博文: https://www.jianshu.com/p/0c967a1526ef

5.分类数据的编码

这里城市的分类显然不适合用独热码编码了,那么如果在其它时候使用sklearn.preprocessing中的LabelBinarizer后重新编码的文本特征又怎样应用到预测中?

one-hot:彼此之间没关系
dummy: 有顺序性的特征
试想编码出来只是数字,建模的时候本质上计算的是类别间的距离
所以我们就需要根据实际情况去判断,我们的类别间到底存不存在距离关系

6.类别不平衡问题

举实际场景例子:
1.信用卡欺诈 可能几百几千个客户里面才会有一个欺诈情况出现
2.网页广告点击率,这个比例更夸张
3.医生病情误判情况

处理数据不平衡问题

  1. 改变数据量大小,使类别间变得平衡
  2. 不改变数据量,设置成本矩阵或代价函数来限定

以上问题是大家提出问题并一起讨论的解决办法,收获颇多,希望之后能够多多参与一起讨论、学习
下面是负责人整理的资料

缺失值处理方法: 数据编码:https://mp.weixin.qq.com/s/U93vvFwZ8vSJuswk24yc6w
数据不平衡问题

数据不平衡问题
what?
通常出现在类别问题中,类别间数据量差异过大
例如:10W正例,100反例

why?
为什么会出现
- 收集的数据量不多,不全面
- 数据集本身特点,如信用卡欺诈、用户投诉、机器故障数据,可能1K个顾客里面,只有1个会欺诈。

为什么需要处理
- 类别间的不平衡,会很容易导致模型预测偏移。比如信用卡欺诈,直接跑决策树,可能预测准确率会高达99.99%不会欺诈。这样的结果并没有意义。
- 所以像这种类别不平衡的,我们需要采取一些策略去调整数据。

how?
目标:
- 改变数据量,使类别间数据量尽量相等。 正例数据量 = 反例数据量
- 数据量不变,通过采用不同模型或者评估方法来消除不平衡的影响

策略:
- 1. 扩大数据集。以期得到更多的分布信息;同时扩大数据量后也方便后面的重新采样。
- 2. 尝试新角度理解问题。我们可以把那些小类的样本作为异常点(outliers),因此该问题便转化为异常点检测(anomaly detection)与变化趋势检测问题(change detection)。

异常点检测即是对那些罕见事件进行识别。如通过机器的部件的振动识别机器故障,又如通过系统调用序列识别恶意程序。这些事件相对于正常情况是很少见的。

变化趋势检测类似于异常点检测,不同在于其通过检测不寻常的变化趋势来识别。如通过观察用户模式或银行交易来检测用户行为的不寻常改变。

Sampling methods(采样方法)
增加偏少的类别 - OverSampling过采样
① 简单随机重复。
② SMOTE。在相同边界内,人工造数据。
减少偏多的类别 - UnderSampling欠采样
① 简单随机抽样。
② 带边界清理。尽量保留分布信息。

采用集成学习思想:
把多数类进行划分,然后和少数类组合成多个小的训练集,然后生成学习器,最后再集成。