特征工程是数据科学和机器学习中的重要技巧,对机器模型性能和EDA(exploratory data analysis)的质量有重要影响。本文介绍几种特征工程技巧
目录
- 什么是特征工程
- 数据集
- 缺失值处理
- 类别特征缺失值处理方法
- 数值特征缺失值处理
- 使用模型填充缺失值
- 类别特征处理
- 类别特征类型
- 独特编码
- 哈希编码
- 数值/连续特征的处理
- 使用领域知识构造特征
- 多项式(交叉)特征
- 特征标准化
- 日期特征处理
- 地理位置特征处理
什么是特征工程
- 使用领域知识来创造特征,使得机器学习算法发挥作用
- 从原始数据提取特征以转换为算法需要的格式
- 需要领域知识、数学和编程技巧
数据集
本文使用贷款违约预测数据和送货平台数据,可直接参考 notebook
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # 关闭警告
#load loan datasets
loan_demographics = pd.read_csv('traindemographics.csv')
loan_prev = pd.read_csv('trainprevloans.csv')
loan_perf = pd.read_csv('trainperf.csv')
#load logistics dataset
sendy_data = pd.read_csv('sendy_logistics.csv')
# 贷款统计信息
# customerid (顾客id)
# birthdate (顾客生日)
# bank_account_type (银行账号类型)
# longitude_gps(经度)
# latitude_gps (纬度)
# bank_name_clients (n银行名称)
# bank_branch_clients (支行,非必填,所以很多缺失值)
# employment_status_clients (职业状态)
# level_of_education_clients (教育程度)
loan_demographics.head().append(loan_demographics.tail())
customerid | birthdate | bank_account_type | longitude_gps | latitude_gps | bank_name_clients | bank_branch_clients | employment_status_clients | level_of_education_clients | |
0 | 8a858e135cb22031015cbafc76964ebd | 1973-10-10 00:00:00.000000 | Savings | 3.319219 | 6.528604 | GT Bank | NaN | NaN | NaN |
1 | 8a858e275c7ea5ec015c82482d7c3996 | 1986-01-21 00:00:00.000000 | Savings | 3.325598 | 7.119403 | Sterling Bank | NaN | Permanent | NaN |
2 | 8a858e5b5bd99460015bdc95cd485634 | 1987-04-01 00:00:00.000000 | Savings | 5.746100 | 5.563174 | Fidelity Bank | NaN | NaN | NaN |
3 | 8a858efd5ca70688015cabd1f1e94b55 | 1991-07-19 00:00:00.000000 | Savings | 3.362850 | 6.642485 | GT Bank | NaN | Permanent | NaN |
4 | 8a858e785acd3412015acd48f4920d04 | 1982-11-22 00:00:00.000000 | Savings | 8.455332 | 11.971410 | GT Bank | NaN | Permanent | NaN |
4341 | 8a858f155554552501555588ca2b3b40 | 1985-12-13 00:00:00.000000 | Other | 3.236753 | 7.030168 | Stanbic IBTC | NaN | Permanent | Graduate |
4342 | 8a858fc65cf978f4015cf97cee3a02ce | 1982-07-01 00:00:00.000000 | Savings | 7.013749 | 4.875662 | GT Bank | NaN | NaN | NaN |
4343 | 8a858f4f5b66de3a015b66fc83c61902 | 1989-09-26 00:00:00.000000 | Savings | 6.295530 | 7.092508 | GT Bank | NaN | Permanent | NaN |
4344 | 8aaae7a74400b28201441c8b62514150 | 1985-09-06 00:00:00.000000 | Savings | 3.354206 | 6.539070 | GT Bank | HEAD OFFICE | Permanent | Primary |
4345 | 8a85896653e2e18b0153e69c1b90265c | 1975-06-05 00:00:00.000000 | Savings | 6.661014 | 7.472700 | UBA | NaN | Permanent | NaN |
# trainperf.csv 顾客连续贷款记录,需根据顾客以前贷款信息和其自身信息预测是否会违约
# customerid (顾客id)
# systemloanid (贷款id. 每个顾客每次贷款有一个id)
# loannumber (需要预测的贷款数)
# approveddate (贷款批准日期)
# creationdate (贷款申请日期)
# loanamount (贷款金额)
# totaldue (总需还款金额,贷款金额+利息+其他费用)
# termdays (贷款期限)
# referredby (担保人id,空则没担保)
# good_bad_flag (good = 按时结清贷款; bad = 未按时结清贷款) - 是要预测的目标值
loan_perf.head().append(loan_perf.tail())
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | referredby | good_bad_flag | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301994762 | 12 | 2017-07-25 08:22:56.000000 | 2017-07-25 07:22:47.000000 | 30000.0 | 34500.0 | 30 | NaN | Good |
1 | 8a85886e54beabf90154c0a29ae757c0 | 301965204 | 2 | 2017-07-05 17:04:41.000000 | 2017-07-05 16:04:18.000000 | 15000.0 | 17250.0 | 30 | NaN | Good |
2 | 8a8588f35438fe12015444567666018e | 301966580 | 7 | 2017-07-06 14:52:57.000000 | 2017-07-06 13:52:51.000000 | 20000.0 | 22250.0 | 15 | NaN | Good |
3 | 8a85890754145ace015429211b513e16 | 301999343 | 3 | 2017-07-27 19:00:41.000000 | 2017-07-27 18:00:35.000000 | 10000.0 | 11500.0 | 15 | NaN | Good |
4 | 8a858970548359cc0154883481981866 | 301962360 | 9 | 2017-07-03 23:42:45.000000 | 2017-07-03 22:42:39.000000 | 40000.0 | 44000.0 | 30 | NaN | Good |
4363 | 8a858e6d58b0cc520158beeb14b22a5a | 302003163 | 2 | 2017-07-30 09:19:42.000000 | 2017-07-30 08:18:30.000000 | 10000.0 | 13000.0 | 30 | NaN | Bad |
4364 | 8a858ee85cf400f5015cf44ab1c42d5c | 301998967 | 2 | 2017-07-27 15:35:47.000000 | 2017-07-27 14:35:40.000000 | 10000.0 | 13000.0 | 30 | NaN | Bad |
4365 | 8a858f365b2547f3015b284597147c94 | 301995576 | 3 | 2017-07-25 16:25:57.000000 | 2017-07-25 15:24:47.000000 | 10000.0 | 11500.0 | 15 | NaN | Bad |
4366 | 8a858f935ca09667015ca0ee3bc63f51 | 301977679 | 2 | 2017-07-14 13:50:27.000000 | 2017-07-14 12:50:21.000000 | 10000.0 | 13000.0 | 30 | 8a858eda5c8863ff015c9dead65807bb | Bad |
4367 | 8a858fd458639fcc015868eb14b542ad | 301967124 | 8 | 2017-07-06 21:01:06.000000 | 2017-07-06 20:01:01.000000 | 30000.0 | 34500.0 | 30 | NaN | Bad |
# trainprevloans.csv: 本数据及是顾客在此次贷款前的所有贷款记录.
# customerid (顾客id)
# systemloanid (贷款id,每个顾客每次贷款都有)
# loannumber (T需要预测的贷款数)
# approveddate (贷款审批通过日期)
# creationdate (贷款申请日期)
# loanamount (贷款金额)
# totaldue (总需还款金额,贷款金额+利息+其他费用)
# termdays (贷款期限)
# closeddate (贷款还清日期)
# referredby (担保人id,若空则无担保)
# firstduedate (贷款期限超30天的首次应付款日期)
# firstrepaiddate (实际首次付款日期)
loan_prev.head().append(loan_prev.tail())
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | closeddate | referredby | firstduedate | firstrepaiddate | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301682320 | 2 | 2016-08-15 18:22:40.000000 | 2016-08-15 17:22:32.000000 | 10000.0 | 13000.0 | 30 | 2016-09-01 16:06:48.000000 | NaN | 2016-09-14 00:00:00.000000 | 2016-09-01 15:51:43.000000 |
1 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301883808 | 9 | 2017-04-28 18:39:07.000000 | 2017-04-28 17:38:53.000000 | 10000.0 | 13000.0 | 30 | 2017-05-28 14:44:49.000000 | NaN | 2017-05-30 00:00:00.000000 | 2017-05-26 00:00:00.000000 |
2 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301831714 | 8 | 2017-03-05 10:56:25.000000 | 2017-03-05 09:56:19.000000 | 20000.0 | 23800.0 | 30 | 2017-04-26 22:18:56.000000 | NaN | 2017-04-04 00:00:00.000000 | 2017-04-26 22:03:47.000000 |
3 | 8a8588f35438fe12015444567666018e | 301861541 | 5 | 2017-04-09 18:25:55.000000 | 2017-04-09 17:25:42.000000 | 10000.0 | 11500.0 | 15 | 2017-04-24 01:35:52.000000 | NaN | 2017-04-24 00:00:00.000000 | 2017-04-24 00:48:43.000000 |
4 | 8a85890754145ace015429211b513e16 | 301941754 | 2 | 2017-06-17 09:29:57.000000 | 2017-06-17 08:29:50.000000 | 10000.0 | 11500.0 | 15 | 2017-07-14 21:18:43.000000 | NaN | 2017-07-03 00:00:00.000000 | 2017-07-14 21:08:35.000000 |
18178 | 8a858899538ddb8e0153a2b555421fc5 | 301611754 | 2 | 2016-04-16 13:36:34.000000 | 2016-04-16 12:36:28.000000 | 10000.0 | 13000.0 | 30 | 2016-05-14 00:04:52.000000 | NaN | 2016-05-16 00:00:00.000000 | 2016-05-13 18:05:07.000000 |
18179 | 8a858899538ddb8e0153a2b555421fc5 | 301761267 | 9 | 2016-11-18 14:26:07.000000 | 2016-11-18 13:25:51.000000 | 30000.0 | 34400.0 | 30 | 2016-12-13 16:08:57.000000 | NaN | 2016-12-19 00:00:00.000000 | 2016-12-13 15:53:48.000000 |
18180 | 8a858899538ddb8e0153a2b555421fc5 | 301631653 | 4 | 2016-06-12 15:30:56.000000 | 2016-06-12 14:30:50.000000 | 10000.0 | 13000.0 | 30 | 2016-07-09 15:39:00.000000 | NaN | 2016-07-12 00:00:00.000000 | 2016-07-09 15:23:56.000000 |
18181 | 8a858f0656b7820c0156c92ca3ba436f | 301697691 | 1 | 2016-08-27 20:03:45.000000 | 2016-08-27 19:03:34.000000 | 10000.0 | 13000.0 | 30 | 2016-10-15 10:17:54.000000 | NaN | 2016-09-26 00:00:00.000000 | 2016-10-15 10:02:45.000000 |
18182 | 8a858faf5679a838015688de3028143d | 301715255 | 2 | 2016-09-14 23:42:14.000000 | 2016-09-14 22:42:05.000000 | 10000.0 | 13000.0 | 30 | 2016-09-29 19:51:04.000000 | NaN | 2016-10-14 00:00:00.000000 | 2016-09-29 19:35:55.000000 |
# 送货数据
sendy_data.head().append(sendy_data.tail())
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | speed | manhattan_dist | haversine_dist | bearing | center_latitude | center_longitude | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | -1.300406 | 36.829741 | Rider_Id_432 | 745.0 | 0.002160 | 0.017978 | 1.930333 | -2.076903 | -1.309080 | 36.830056 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | -1.295004 | 36.814358 | Rider_Id_856 | 1993.0 | 0.001422 | 0.141406 | 11.339849 | -56.392163 | -1.323229 | 36.856837 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | -1.300921 | 36.828195 | Rider_Id_155 | 455.0 | 0.003047 | 0.022588 | 1.880079 | -64.183866 | -1.304603 | 36.835807 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | -1.257147 | 36.795063 | Rider_Id_855 | 1341.0 | 0.001717 | 0.061487 | 4.943458 | -57.091553 | -1.269224 | 36.813730 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | -1.295041 | 36.809817 | Rider_Id_770 | 1214.0 | 0.001897 | 0.046143 | 3.724829 | 148.114398 | -1.280819 | 36.800968 |
21196 | Order_No_8834 | User_Id_2001 | Bike | 3 | Personal | 20 | 3 | 3:54:38 PM | 20 | 3 | ... | -1.275285 | 36.802702 | Rider_Id_953 | 9.0 | 0.154033 | 0.018968 | 1.890335 | -172.912798 | -1.266849 | 36.803751 |
21197 | Order_No_22892 | User_Id_1796 | Bike | 3 | Business | 13 | 6 | 10:13:34 AM | 13 | 6 | ... | -1.331619 | 36.847976 | Rider_Id_155 | 770.0 | 0.002701 | 0.047443 | 3.731709 | 136.829614 | -1.319381 | 36.836493 |
21198 | Order_No_2831 | User_Id_2956 | Bike | 3 | Business | 7 | 4 | 5:06:16 PM | 7 | 4 | ... | -1.258414 | 36.804800 | Rider_Id_697 | 2953.0 | 0.001031 | 0.120338 | 10.756212 | -73.420487 | -1.272216 | 36.851167 |
21199 | Order_No_6174 | User_Id_2524 | Bike | 1 | Personal | 4 | 3 | 9:31:39 AM | 4 | 3 | ... | -1.279209 | 36.794872 | Rider_Id_347 | 1380.0 | 0.001912 | 0.108474 | 9.393189 | -110.208129 | -1.264620 | 36.834520 |
21200 | Order_No_9836 | User_Id_718 | Bike | 3 | Business | 26 | 2 | 2:19:47 PM | 26 | 2 | ... | -1.320157 | 36.830887 | Rider_Id_177 | 2128.0 | 0.001205 | 0.113651 | 9.026448 | 143.161318 | -1.287673 | 36.806545 |
10 rows × 35 columns
这里主要有三种特征:数值、类别、日期特征,以下特征工程没有指定顺序,也不一定适用所有场景,需根据实际情况选择
缺失值处理
缺失值的出现可能是顾客没有提供,或因为错误留空,或因为难以估量. 缺失值可能对模型有重大影响,这里列出几种处理方式。
处理方式依赖特征类型
类别特征缺失值处理方法
有:众数填充、顺序填充(用前面/后面的值填充)、编码填充.
在贷款统计信息 loan_demographics 中,有三个类别特征:bank_branch_clients, employment_status_clients, level_of_education_clients 包含缺失值
## 查看缺失值情况
loan_demographics.isna().sum()
customerid 0
birthdate 0
bank_account_type 0
longitude_gps 0
latitude_gps 0
bank_name_clients 0
bank_branch_clients 4295
employment_status_clients 648
level_of_education_clients 3759
dtype: int64
## 查看类别分布
loan_demographics['employment_status_clients'].value_counts()
Permanent 3146
Self-Employed 348
Student 142
Unemployed 57
Retired 4
Contract 1
Name: employment_status_clients, dtype: int64
可以看到在职状态中,Permanent 出现的最多,因此可以用这个值填充缺失值
## 众数填充 (Permanent)
loan_demographics['employment_status_clients'] = loan_demographics['employment_status_clients'].fillna(value='Permanent')
loan_demographics.isna().sum()
customerid 0
birthdate 0
bank_account_type 0
longitude_gps 0
latitude_gps 0
bank_name_clients 0
bank_branch_clients 4295
employment_status_clients 0
level_of_education_clients 3759
dtype: int64
数值特征缺失值处理
1、用均值、众数、中位数填充
2、时序填充(前向/后向)
3、用机器学习模型填充:训练机器学习模型填充缺失值
Sendy logistics 数据集中有两项数值特征:Temperature 和 Precipitation in millimeters,以此为例:
sendy_data.head().append(sendy_data.tail())
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Arrival at Destination - Time | Distance (KM) | Temperature | Precipitation in millimeters | Pickup Lat | Pickup Long | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | 10:39:55 AM | 4 | 20.4 | NaN | -1.317755 | 36.830370 | -1.300406 | 36.829741 | Rider_Id_432 | 745 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | 12:17:22 PM | 16 | 26.4 | NaN | -1.351453 | 36.899315 | -1.295004 | 36.814358 | Rider_Id_856 | 1993 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | 1:00:38 PM | 3 | NaN | NaN | -1.308284 | 36.843419 | -1.300921 | 36.828195 | Rider_Id_155 | 455 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | 10:05:27 AM | 9 | 19.2 | NaN | -1.281301 | 36.832396 | -1.257147 | 36.795063 | Rider_Id_855 | 1341 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | 10:25:37 AM | 9 | 15.4 | NaN | -1.266597 | 36.792118 | -1.295041 | 36.809817 | Rider_Id_770 | 1214 |
21196 | Order_No_8834 | User_Id_2001 | Bike | 3 | Personal | 20 | 3 | 3:54:38 PM | 20 | 3 | ... | 4:20:17 PM | 3 | 28.6 | NaN | -1.258414 | 36.804800 | -1.275285 | 36.802702 | Rider_Id_953 | 9 |
21197 | Order_No_22892 | User_Id_1796 | Bike | 3 | Business | 13 | 6 | 10:13:34 AM | 13 | 6 | ... | 10:46:17 AM | 7 | 26.0 | NaN | -1.307143 | 36.825009 | -1.331619 | 36.847976 | Rider_Id_155 | 770 |
21198 | Order_No_2831 | User_Id_2956 | Bike | 3 | Business | 7 | 4 | 5:06:16 PM | 7 | 4 | ... | 6:40:05 PM | 20 | 29.2 | NaN | -1.286018 | 36.897534 | -1.258414 | 36.804800 | Rider_Id_697 | 2953 |
21199 | Order_No_6174 | User_Id_2524 | Bike | 1 | Personal | 4 | 3 | 9:31:39 AM | 4 | 3 | ... | 10:08:15 AM | 13 | 15.0 | NaN | -1.250030 | 36.874167 | -1.279209 | 36.794872 | Rider_Id_347 | 1380 |
21200 | Order_No_9836 | User_Id_718 | Bike | 3 | Business | 26 | 2 | 2:19:47 PM | 26 | 2 | ... | 3:17:23 PM | 12 | 30.9 | NaN | -1.255189 | 36.782203 | -1.320157 | 36.830887 | Rider_Id_177 | 2128 |
10 rows × 29 columns
# 注意:以此只用一种填充方式,众数可能有多个,因此带下标0
mean_df = round(sendy_data['Temperature'].mean())
mode_df = round(sendy_data['Temperature'].mode()[0])
median_df = round(sendy_data['Temperature'].median())
#Fill with mean
print("Filling with mean value of {}".format(mean_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(mean_df)
#Fill with mode
print("Filling with modal value of {}".format(mode_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(mode_df)
#Fill with median
print("Filling with median value of {}".format(median_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(median_df)
Filling with mean value of 23.0
Filling with modal value of 23.0
Filling with median value of 23.0
使用模型填充缺失值
这里对Precipitation in millimeters特征进行模型填充,先找到与目标特征相关的特征值加以利用.
The Seaborn 热图有助于找到相关特征.
plt.figure(figsize = (15,10))
sns.heatmap(sendy_data.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x15c46e85cc0>
上图看出,大部分特征都与之不相关,这里使用最后三个特征: Destination Lat,Destination Long,Time from Pickup to Arrival
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
to_train = ['Precipitation in millimeters', 'Destination Lat', 'Destination Long', 'Time from Pickup to Arrival']
temp_df = sendy_data[to_train]
#Split dataset with missing values and no missing values as test and train set respectively.
x_train = temp_df[temp_df['Precipitation in millimeters'].notnull()].drop(columns='Precipitation in millimeters')
y_train = temp_df[temp_df['Precipitation in millimeters'].notnull()]['Precipitation in millimeters']
x_test = temp_df[temp_df['Precipitation in millimeters'].isnull()].drop(columns='Precipitation in millimeters')
#Fit a simple linear model to the dataset
lr.fit(x_train, y_train)
pred = lr.predict(x_test)
#print fill values
print(np.round(pred, 5))
#Perform filling
sendy_data['Precipitation in millimeters'][sendy_data['Precipitation in millimeters'].isnull()] = pred
[8.03056 8.18609 7.86142 ... 8.46164 7.45028 8.60717]
也可用sklearn.experimental 模块的IterativeImputer 来自动填充
missing_indx = list(sendy_data['Temperature'][sendy_data['Temperature'].isna()].index)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
# 使用带随机森林的IterativeImputer
imp = IterativeImputer(RandomForestRegressor(n_estimators=5), max_iter=5, random_state=1)
to_train = ['Temperature', 'Destination Lat', 'Destination Long','Time from Pickup to Arrival']
#填充
sendy_data[to_train] = pd.DataFrame(imp.fit_transform(sendy_data[to_train]), columns=to_train)
sendy_data['Temperature'][missing_indx].head(10)
2 23.70
8 26.06
14 21.66
15 20.42
16 21.46
32 23.84
42 22.32
46 27.18
49 22.34
53 21.36
Name: Temperature, dtype: float64
类别特征处理
类别特征只取有限的值. 如某地点/事物的热度或者app的评分等级(1,2,3,4,5).
本文中,loan_demographics信息的教育程度level_of_education_clients是类别特征,取值有:Secondary, Graduate, Post-Graduate, Primary.
机器学习算法不能直接作用域类别特征原始形式,因此必须转为数值形式,此过程成为编码
编码有多种方式,具体取决于类别特征类型.
类别特征类型
1、有序类别特征:有自然顺序,如评分等级 (1,2,3,4,5).
2、无序类别特征:无先后之分。例如食物:(大米、意面、通心粉).
有序特征编码:若类别数较少,如教育程度 level_of_education_clients ,可人工编码.
loan_demographics['level_of_education_clients'].unique()
array([nan, 'Secondary', 'Graduate', 'Post-Graduate', 'Primary'],
dtype=object)
#use a simple map function
map_education = {"Primary" : 1, "Secondary": 2, "Graduate": 3, "Post-Graduate": 4}
loan_demographics['level_of_education_clients'] = loan_demographics['level_of_education_clients'].map(map_education)
loan_demographics['level_of_education_clients'].value_counts()
3.0 420
2.0 89
4.0 68
1.0 10
Name: level_of_education_clients, dtype: int64
若需自动编码,可以用categorical_encoders库,有大量编码方式。
!pip install category_encoders
Collecting category_encoders
Downloading category_encoders-2.1.0-py2.py3-none-any.whl (100 kB)
Requirement already satisfied: numpy>=1.11.3 in c:\anaconda3\envs\mytf\lib\site-packages (from category_encoders) (1.18.1)
Collecting patsy>=0.4.1
Downloading patsy-0.5.1-py2.py3-none-any.whl (231 kB)
Requirement already satisfied: scipy>=0.19.0 in c:\anaconda3\envs\mytf\lib\site-packages (from category_encoders) (1.4.1)
Requirement already satisfied: scikit-learn>=0.20.0 in c:\anaconda3\envs\mytf\lib\site-packages (from category_encoders) (0.22.2.post1)
Requirement already satisfied: pandas>=0.21.1 in c:\anaconda3\envs\mytf\lib\site-packages (from category_encoders) (1.0.2)
Collecting statsmodels>=0.6.1
Downloading statsmodels-0.11.1-cp37-none-win_amd64.whl (8.2 MB)
Requirement already satisfied: six in c:\anaconda3\envs\mytf\lib\site-packages (from patsy>=0.4.1->category_encoders) (1.14.0)
Requirement already satisfied: joblib>=0.11 in c:\anaconda3\envs\mytf\lib\site-packages (from scikit-learn>=0.20.0->category_encoders) (0.14.1)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\anaconda3\envs\mytf\lib\site-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\anaconda3\envs\mytf\lib\site-packages (from pandas>=0.21.1->category_encoders) (2019.3)
Installing collected packages: patsy, statsmodels, category-encoders
Successfully installed category-encoders-2.1.0 patsy-0.5.1 statsmodels-0.11.1
有序特征若类别数较多,可用标签编码(Label Encoding),对每个类别标记唯一的数值
这里以特征bank_name_clients和bank_branch_clients 作展示,两者类别数量较多
cat_cols = loan_demographics.select_dtypes(include='object').columns # object 为类别特征
for col in cat_cols:
print("Number of classes in {}".format(col))
print(loan_demographics[col].nunique())
print('--------------------------')
Number of classes in customerid
4334
--------------------------
Number of classes in birthdate
3297
--------------------------
Number of classes in bank_account_type
3
--------------------------
Number of classes in bank_name_clients
18
--------------------------
Number of classes in bank_branch_clients
45
--------------------------
Number of classes in employment_status_clients
6
--------------------------
loan_demographics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4346 entries, 0 to 4345
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerid 4346 non-null object
1 birthdate 4346 non-null object
2 bank_account_type 4346 non-null object
3 longitude_gps 4346 non-null float64
4 latitude_gps 4346 non-null float64
5 bank_name_clients 4346 non-null object
6 bank_branch_clients 51 non-null object
7 employment_status_clients 4346 non-null object
8 level_of_education_clients 587 non-null float64
dtypes: float64(3), object(6)
memory usage: 305.7+ KB
categorical_encoders 库的OrdinalEncoder 函数可用于编码
import category_encoders as ce
#Label encoding
cat_cols = ['bank_name_clients', 'bank_branch_clients']
encoder = ce.OrdinalEncoder(cols=cat_cols)
loan_demographics = encoder.fit_transform(loan_demographics)
loan_demographics.bank_name_clients.value_counts()
1 1598
11 590
4 432
8 353
9 301
10 282
5 150
13 145
6 121
3 103
7 102
2 55
18 28
12 23
15 22
16 19
14 15
17 7
Name: bank_name_clients, dtype: int64
独特编码
使用二进制值表示类别,每种类别创建一个特征,在类别特征可取值变多时,效率降低。
cats = ['bank_account_type','level_of_education_clients']
one_hot_enc = ce.OneHotEncoder(cols=cats)
loan_demographics = one_hot_enc.fit_transform(loan_demographics)
loan_demographics.head().T
0 | 1 | 2 | 3 | 4 | |
customerid | 8a858e135cb22031015cbafc76964ebd | 8a858e275c7ea5ec015c82482d7c3996 | 8a858e5b5bd99460015bdc95cd485634 | 8a858efd5ca70688015cabd1f1e94b55 | 8a858e785acd3412015acd48f4920d04 |
birthdate | 1973-10-10 00:00:00.000000 | 1986-01-21 00:00:00.000000 | 1987-04-01 00:00:00.000000 | 1991-07-19 00:00:00.000000 | 1982-11-22 00:00:00.000000 |
bank_account_type_1 | 1 | 1 | 1 | 1 | 1 |
bank_account_type_2 | 0 | 0 | 0 | 0 | 0 |
bank_account_type_3 | 0 | 0 | 0 | 0 | 0 |
longitude_gps | 3.31922 | 3.3256 | 5.7461 | 3.36285 | 8.45533 |
latitude_gps | 6.5286 | 7.1194 | 5.56317 | 6.64249 | 11.9714 |
bank_name_clients | 1 | 2 | 3 | 1 | 1 |
bank_branch_clients | 1 | 1 | 1 | 1 | 1 |
employment_status_clients | Permanent | Permanent | Permanent | Permanent | Permanent |
level_of_education_clients_1 | 1 | 1 | 1 | 1 | 1 |
level_of_education_clients_2 | 0 | 0 | 0 | 0 | 0 |
level_of_education_clients_3 | 0 | 0 | 0 | 0 | 0 |
level_of_education_clients_4 | 0 | 0 | 0 | 0 | 0 |
level_of_education_clients_5 | 0 | 0 | 0 | 0 | 0 |
哈希编码
是一种快速、节省空间的特征编码方式. 在类别特征取值多时比较高效。对特征值施加hash函数
cat_cols = ['bank_name_clients', 'bank_branch_clients']
hash_enc = ce.HashingEncoder(cols=cat_cols, n_components=10)
loan_demographics = hash_enc.fit_transform(loan_demographics)
loan_demographics.head()
customerid | birthdate | bank_account_type | longitude_gps | latitude_gps | bank_name_clients | bank_branch_clients | employment_status_clients | level_of_education_clients | |
0 | 8a858e135cb22031015cbafc76964ebd | 1973-10-10 00:00:00.000000 | Savings | 3.319219 | 6.528604 | 1 | 1 | Permanent | NaN |
1 | 8a858e275c7ea5ec015c82482d7c3996 | 1986-01-21 00:00:00.000000 | Savings | 3.325598 | 7.119403 | 2 | 1 | Permanent | NaN |
2 | 8a858e5b5bd99460015bdc95cd485634 | 1987-04-01 00:00:00.000000 | Savings | 5.746100 | 5.563174 | 3 | 1 | Permanent | NaN |
3 | 8a858efd5ca70688015cabd1f1e94b55 | 1991-07-19 00:00:00.000000 | Savings | 3.362850 | 6.642485 | 1 | 1 | Permanent | NaN |
4 | 8a858e785acd3412015acd48f4920d04 | 1982-11-22 00:00:00.000000 | Savings | 8.455332 | 11.971410 | 1 | 1 | Permanent | NaN |
更多编码详见:https://contrib.scikit-learn.org/categorical-encoding/
数值/连续特征的处理
数值/连续特征是数据集中最常见特征的形式。关于此类特征的工程往往基于领域知识,可做以下处理:
对数变换:中心化 (正则分布),可让大部分算法效果更好.
注意:当对目标特征做对数变换时,在做模型解释时需要加上指数变换。
对数变换常用于有偏特征,或左偏或有偏,偏度可用图像观察。
这里对sendy数据集的Distance 特征做分析,为右偏
sns.distplot(sendy_data['Distance (KM)'])
plt.title("Histogram of Distance (KM)")
plt.show()
sendy_data['Distance (KM)'] = np.log1p(sendy_data['Distance (KM)']) # 使用加1后去对数,避免对0取对数
sns.distplot(sendy_data['Distance (KM)'])
plt.title("Log-Transformed plot of Distance (KM)")
plt.show()
使用领域知识构造特征
当对数据的领域了解时,可以依此创建一些对模型有帮助的特征,示例如下:
1、利息 Interest elapsed:为总需还款金额totaldue 与贷款金额loanamount之差
loan_prev['interest_elapsed'] = loan_prev['totaldue'] - loan_prev['loanamount']
loan_prev['interest_elapsed']
0 3000.0
1 3000.0
2 3800.0
3 1500.0
4 1500.0
...
18178 3000.0
18179 4400.0
18180 3000.0
18181 3000.0
18182 3000.0
Name: interest_elapsed, Length: 18183, dtype: float64
loan_prev
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | closeddate | referredby | firstduedate | firstrepaiddate | interest_elapsed | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301682320 | 2 | 2016-08-15 18:22:40.000000 | 2016-08-15 17:22:32.000000 | 10000.0 | 13000.0 | 30 | 2016-09-01 16:06:48.000000 | NaN | 2016-09-14 00:00:00.000000 | 2016-09-01 15:51:43.000000 | 3000.0 |
1 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301883808 | 9 | 2017-04-28 18:39:07.000000 | 2017-04-28 17:38:53.000000 | 10000.0 | 13000.0 | 30 | 2017-05-28 14:44:49.000000 | NaN | 2017-05-30 00:00:00.000000 | 2017-05-26 00:00:00.000000 | 3000.0 |
2 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301831714 | 8 | 2017-03-05 10:56:25.000000 | 2017-03-05 09:56:19.000000 | 20000.0 | 23800.0 | 30 | 2017-04-26 22:18:56.000000 | NaN | 2017-04-04 00:00:00.000000 | 2017-04-26 22:03:47.000000 | 3800.0 |
3 | 8a8588f35438fe12015444567666018e | 301861541 | 5 | 2017-04-09 18:25:55.000000 | 2017-04-09 17:25:42.000000 | 10000.0 | 11500.0 | 15 | 2017-04-24 01:35:52.000000 | NaN | 2017-04-24 00:00:00.000000 | 2017-04-24 00:48:43.000000 | 1500.0 |
4 | 8a85890754145ace015429211b513e16 | 301941754 | 2 | 2017-06-17 09:29:57.000000 | 2017-06-17 08:29:50.000000 | 10000.0 | 11500.0 | 15 | 2017-07-14 21:18:43.000000 | NaN | 2017-07-03 00:00:00.000000 | 2017-07-14 21:08:35.000000 | 1500.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18178 | 8a858899538ddb8e0153a2b555421fc5 | 301611754 | 2 | 2016-04-16 13:36:34.000000 | 2016-04-16 12:36:28.000000 | 10000.0 | 13000.0 | 30 | 2016-05-14 00:04:52.000000 | NaN | 2016-05-16 00:00:00.000000 | 2016-05-13 18:05:07.000000 | 3000.0 |
18179 | 8a858899538ddb8e0153a2b555421fc5 | 301761267 | 9 | 2016-11-18 14:26:07.000000 | 2016-11-18 13:25:51.000000 | 30000.0 | 34400.0 | 30 | 2016-12-13 16:08:57.000000 | NaN | 2016-12-19 00:00:00.000000 | 2016-12-13 15:53:48.000000 | 4400.0 |
18180 | 8a858899538ddb8e0153a2b555421fc5 | 301631653 | 4 | 2016-06-12 15:30:56.000000 | 2016-06-12 14:30:50.000000 | 10000.0 | 13000.0 | 30 | 2016-07-09 15:39:00.000000 | NaN | 2016-07-12 00:00:00.000000 | 2016-07-09 15:23:56.000000 | 3000.0 |
18181 | 8a858f0656b7820c0156c92ca3ba436f | 301697691 | 1 | 2016-08-27 20:03:45.000000 | 2016-08-27 19:03:34.000000 | 10000.0 | 13000.0 | 30 | 2016-10-15 10:17:54.000000 | NaN | 2016-09-26 00:00:00.000000 | 2016-10-15 10:02:45.000000 | 3000.0 |
18182 | 8a858faf5679a838015688de3028143d | 301715255 | 2 | 2016-09-14 23:42:14.000000 | 2016-09-14 22:42:05.000000 | 10000.0 | 13000.0 | 30 | 2016-09-29 19:51:04.000000 | NaN | 2016-10-14 00:00:00.000000 | 2016-09-29 19:35:55.000000 | 3000.0 |
18183 rows × 13 columns
2、贷款数量Loan count:根据顾客id分组,累计loannumber
loan_prev.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18183 entries, 0 to 18182
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerid 18183 non-null object
1 systemloanid 18183 non-null int64
2 loannumber 18183 non-null int64
3 approveddate 18183 non-null object
4 creationdate 18183 non-null object
5 loanamount 18183 non-null float64
6 totaldue 18183 non-null float64
7 termdays 18183 non-null int64
8 closeddate 18183 non-null object
9 referredby 1026 non-null object
10 firstduedate 18183 non-null object
11 firstrepaiddate 18183 non-null object
12 interest_elapsed 18183 non-null float64
dtypes: float64(3), int64(3), object(7)
memory usage: 1.8+ MB
#Groupby customer id and calculate their total loans taken
loannumber_count = loan_prev.groupby(by='customerid').agg(['count'])[['loannumber']].reset_index()
#merge back to dataset on customer_id
loan_prev = loan_prev.merge(right=loannumber_count, how='left', on='customerid')
loan_prev.head()
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | closeddate | referredby | firstduedate | firstrepaiddate | (loannumber, count) | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301682320 | 2 | 2016-08-15 18:22:40.000000 | 2016-08-15 17:22:32.000000 | 10000.0 | 13000.0 | 30 | 2016-09-01 16:06:48.000000 | NaN | 2016-09-14 00:00:00.000000 | 2016-09-01 15:51:43.000000 | 11 |
1 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301883808 | 9 | 2017-04-28 18:39:07.000000 | 2017-04-28 17:38:53.000000 | 10000.0 | 13000.0 | 30 | 2017-05-28 14:44:49.000000 | NaN | 2017-05-30 00:00:00.000000 | 2017-05-26 00:00:00.000000 | 11 |
2 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301831714 | 8 | 2017-03-05 10:56:25.000000 | 2017-03-05 09:56:19.000000 | 20000.0 | 23800.0 | 30 | 2017-04-26 22:18:56.000000 | NaN | 2017-04-04 00:00:00.000000 | 2017-04-26 22:03:47.000000 | 11 |
3 | 8a8588f35438fe12015444567666018e | 301861541 | 5 | 2017-04-09 18:25:55.000000 | 2017-04-09 17:25:42.000000 | 10000.0 | 11500.0 | 15 | 2017-04-24 01:35:52.000000 | NaN | 2017-04-24 00:00:00.000000 | 2017-04-24 00:48:43.000000 | 6 |
4 | 8a85890754145ace015429211b513e16 | 301941754 | 2 | 2017-06-17 09:29:57.000000 | 2017-06-17 08:29:50.000000 | 10000.0 | 11500.0 | 15 | 2017-07-14 21:18:43.000000 | NaN | 2017-07-03 00:00:00.000000 | 2017-07-14 21:08:35.000000 | 2 |
loan_prev.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18183 entries, 0 to 18182
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerid 18183 non-null object
1 systemloanid 18183 non-null int64
2 loannumber 18183 non-null int64
3 approveddate 18183 non-null object
4 creationdate 18183 non-null object
5 loanamount 18183 non-null float64
6 totaldue 18183 non-null float64
7 termdays 18183 non-null int64
8 closeddate 18183 non-null object
9 referredby 1026 non-null object
10 firstduedate 18183 non-null object
11 firstrepaiddate 18183 non-null object
12 (loannumber, count) 18183 non-null int64
dtypes: float64(2), int64(4), object(7)
memory usage: 1.9+ MB
3、速度Speed:发货距离Distance (KM)与时间Time from Pickup to Arrival之比
#create feature speed in sendy dataset
sendy_data['speed'] = sendy_data['Distance (KM)'] / sendy_data['Time from Pickup to Arrival']
sendy_data.head().T
0 | 1 | 2 | 3 | 4 | |
Order No | Order_No_4211 | Order_No_25375 | Order_No_1899 | Order_No_9336 | Order_No_27883 |
User Id | User_Id_633 | User_Id_2285 | User_Id_265 | User_Id_1402 | User_Id_1737 |
Vehicle Type | Bike | Bike | Bike | Bike | Bike |
Platform Type | 3 | 3 | 3 | 3 | 1 |
Personal or Business | Business | Personal | Business | Business | Personal |
Placement - Day of Month | 9 | 12 | 30 | 15 | 13 |
Placement - Weekday (Mo = 1) | 5 | 5 | 2 | 5 | 1 |
Placement - Time | 9:35:46 AM | 11:16:16 AM | 12:39:25 PM | 9:25:34 AM | 9:55:18 AM |
Confirmation - Day of Month | 9 | 12 | 30 | 15 | 13 |
Confirmation - Weekday (Mo = 1) | 5 | 5 | 2 | 5 | 1 |
Confirmation - Time | 9:40:10 AM | 11:23:21 AM | 12:42:44 PM | 9:26:05 AM | 9:56:18 AM |
Arrival at Pickup - Day of Month | 9 | 12 | 30 | 15 | 13 |
Arrival at Pickup - Weekday (Mo = 1) | 5 | 5 | 2 | 5 | 1 |
Arrival at Pickup - Time | 10:04:47 AM | 11:40:22 AM | 12:49:34 PM | 9:37:56 AM | 10:03:53 AM |
Pickup - Day of Month | 9 | 12 | 30 | 15 | 13 |
Pickup - Weekday (Mo = 1) | 5 | 5 | 2 | 5 | 1 |
Pickup - Time | 10:27:30 AM | 11:44:09 AM | 12:53:03 PM | 9:43:06 AM | 10:05:23 AM |
Arrival at Destination - Day of Month | 9 | 12 | 30 | 15 | 13 |
Arrival at Destination - Weekday (Mo = 1) | 5 | 5 | 2 | 5 | 1 |
Arrival at Destination - Time | 10:39:55 AM | 12:17:22 PM | 1:00:38 PM | 10:05:27 AM | 10:25:37 AM |
Distance (KM) | 1.60944 | 2.83321 | 1.38629 | 2.30259 | 2.30259 |
Temperature | 20.4 | 26.4 | 23.7 | 19.2 | 15.4 |
Precipitation in millimeters | NaN | NaN | NaN | NaN | NaN |
Pickup Lat | -1.31775 | -1.35145 | -1.30828 | -1.2813 | -1.2666 |
Pickup Long | 36.8304 | 36.8993 | 36.8434 | 36.8324 | 36.7921 |
Destination Lat | -1.30041 | -1.295 | -1.30092 | -1.25715 | -1.29504 |
Destination Long | 36.8297 | 36.8144 | 36.8282 | 36.7951 | 36.8098 |
Rider Id | Rider_Id_432 | Rider_Id_856 | Rider_Id_155 | Rider_Id_855 | Rider_Id_770 |
Time from Pickup to Arrival | 745 | 1993 | 455 | 1341 | 1214 |
speed | 0.00216032 | 0.00142158 | 0.0030468 | 0.00171707 | 0.00189669 |
多项式(交叉)特征
多项式特征是在特征之间创建关联,有助于捕捉特征之间的关系,减少模型偏差(只要不过拟合).
可以通过特征之间的加、乘、除来创建交叉特征.
这里用sklearn的特征模块,从loan previous 数据集的loannumber, totaldue,termdays 创建特征
#Use Sklearn Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures()
to_cross = ['loannumber', 'totaldue', 'termdays']
crossed_feats = poly.fit_transform(loan_prev[to_cross].values)
#Convert to Pandas DataFrame and merge to original dataset
crossed_feats = pd.DataFrame(crossed_feats)
loan_prev = pd.concat([loan_prev, crossed_feats], axis=1)
loan_prev.head().T
0 | 1 | 2 | 3 | 4 | |
customerid | 8a2a81a74ce8c05d014cfb32a0da1049 | 8a2a81a74ce8c05d014cfb32a0da1049 | 8a2a81a74ce8c05d014cfb32a0da1049 | 8a8588f35438fe12015444567666018e | 8a85890754145ace015429211b513e16 |
systemloanid | 301682320 | 301883808 | 301831714 | 301861541 | 301941754 |
loannumber | 2 | 9 | 8 | 5 | 2 |
approveddate | 2016-08-15 18:22:40.000000 | 2017-04-28 18:39:07.000000 | 2017-03-05 10:56:25.000000 | 2017-04-09 18:25:55.000000 | 2017-06-17 09:29:57.000000 |
creationdate | 2016-08-15 17:22:32.000000 | 2017-04-28 17:38:53.000000 | 2017-03-05 09:56:19.000000 | 2017-04-09 17:25:42.000000 | 2017-06-17 08:29:50.000000 |
loanamount | 10000 | 10000 | 20000 | 10000 | 10000 |
totaldue | 13000 | 13000 | 23800 | 11500 | 11500 |
termdays | 30 | 30 | 30 | 15 | 15 |
closeddate | 2016-09-01 16:06:48.000000 | 2017-05-28 14:44:49.000000 | 2017-04-26 22:18:56.000000 | 2017-04-24 01:35:52.000000 | 2017-07-14 21:18:43.000000 |
referredby | NaN | NaN | NaN | NaN | NaN |
firstduedate | 2016-09-14 00:00:00.000000 | 2017-05-30 00:00:00.000000 | 2017-04-04 00:00:00.000000 | 2017-04-24 00:00:00.000000 | 2017-07-03 00:00:00.000000 |
firstrepaiddate | 2016-09-01 15:51:43.000000 | 2017-05-26 00:00:00.000000 | 2017-04-26 22:03:47.000000 | 2017-04-24 00:48:43.000000 | 2017-07-14 21:08:35.000000 |
(loannumber, count) | 11 | 11 | 11 | 6 | 2 |
0 | 1 | 1 | 1 | 1 | 1 |
1 | 2 | 9 | 8 | 5 | 2 |
2 | 13000 | 13000 | 23800 | 11500 | 11500 |
3 | 30 | 30 | 30 | 15 | 15 |
4 | 4 | 81 | 64 | 25 | 4 |
5 | 26000 | 117000 | 190400 | 57500 | 23000 |
6 | 60 | 270 | 240 | 75 | 30 |
7 | 1.69e+08 | 1.69e+08 | 5.6644e+08 | 1.3225e+08 | 1.3225e+08 |
8 | 390000 | 390000 | 714000 | 172500 | 172500 |
9 | 900 | 900 | 900 | 225 | 225 |
特征标准化
标准化将数值特征分布修改到正常范围,不破坏数据取值范围和信息。
正则化对于基于距离的模型如KNN十分重要,也有助于提高神经网络的训练速度.
sklearn 提供的标准化函数:
StandardScaler: 减去均值、缩放到单位方差.
RobustScaler: 使用对异常值有鲁棒性的统计信息来缩放特征
MinMaxScaler: 将特征缩放到指定范围(范围可自定义).
注意: 不要将标准化函数在测试/验证集上训练。
sklearn的 标准化函数对缺失值不具有鲁棒性,因此使用前要做缺失值填充.
from sklearn.preprocessing import StandardScaler
feats = ['loannumber', 'totaldue', 'termdays']
sc = StandardScaler()
sc_data = sc.fit_transform(loan_prev[feats])
sc_data
array([[-0.67377132, -0.62877649, 0.30213166],
[ 1.48047238, -0.62877649, 0.30213166],
[ 1.17272328, 0.40432506, 0.30213166],
...,
[-0.05827312, -0.62877649, 0.30213166],
[-0.98152042, -0.62877649, 0.30213166],
[-0.67377132, -0.62877649, 0.30213166]])
sc_data.shape
(18183, 3)
from sklearn.preprocessing import RobustScaler
robsc = RobustScaler()
rb_data = robsc.fit_transform(loan_prev[feats])
rb_data
array([[-0.25 , 0. , 0. ],
[ 1.5 , 0. , 0. ],
[ 1.25 , 0.83076923, 0. ],
...,
[ 0.25 , 0. , 0. ],
[-0.5 , 0. , 0. ],
[-0.25 , 0. , 0. ]])
from sklearn.preprocessing import MinMaxScaler
minsc = MinMaxScaler(feature_range=(0,2))
minmax_data = minsc.fit_transform(loan_prev[feats])
minmax_data
array([[0.08 , 0.29543697, 0.4 ],
[0.64 , 0.29543697, 0.4 ],
[0.56 , 0.6295437 , 0.4 ],
...,
[0.24 , 0.29543697, 0.4 ],
[0. , 0.29543697, 0.4 ],
[0.08 , 0.29543697, 0.4 ]])
日期特征处理
对时间特征可做很多处理,如时间差
对loan performance 数据集的申请时间creationdate 和审批通过时间做处理
#First convert to pandas datetime format
loan_perf['approveddate'] = pd.to_datetime(loan_perf['approveddate'])
loan_perf['creationdate'] = pd.to_datetime(loan_perf['creationdate'])
loan_perf['date_elapsed_in_secs'] = (loan_perf['approveddate'] - loan_perf['creationdate']) / np.timedelta64(1,'s') #can subtitute with [h,m,s]
loan_perf.head()
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | referredby | good_bad_flag | date_elapsed_in_secs | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301994762 | 12 | 2017-07-25 08:22:56 | 2017-07-25 07:22:47 | 30000.0 | 34500.0 | 30 | NaN | 0 | 3609.0 |
1 | 8a85886e54beabf90154c0a29ae757c0 | 301965204 | 2 | 2017-07-05 17:04:41 | 2017-07-05 16:04:18 | 15000.0 | 17250.0 | 30 | NaN | 0 | 3623.0 |
2 | 8a8588f35438fe12015444567666018e | 301966580 | 7 | 2017-07-06 14:52:57 | 2017-07-06 13:52:51 | 20000.0 | 22250.0 | 15 | NaN | 0 | 3606.0 |
3 | 8a85890754145ace015429211b513e16 | 301999343 | 3 | 2017-07-27 19:00:41 | 2017-07-27 18:00:35 | 10000.0 | 11500.0 | 15 | NaN | 0 | 3606.0 |
4 | 8a858970548359cc0154883481981866 | 301962360 | 9 | 2017-07-03 23:42:45 | 2017-07-03 22:42:39 | 40000.0 | 44000.0 | 30 | NaN | 0 | 3606.0 |
以下展示如何提取时间特征如:天、周、小时、秒等:
#First convert to pandas datetime format
loan_perf['approveddate'] = pd.to_datetime(loan_perf['approveddate'])
#use pandas built in functions
loan_perf['approved_day'] = loan_perf['approveddate'].dt.day
loan_perf['approved_week'] = loan_perf['approveddate'].dt.week
loan_perf['approved_hour'] = loan_perf['approveddate'].dt.hour
Pandas date特征还有其他属性可尝试.
还可以提取一天中的阶段(morning, afternoon, evenings)
def map_hours(x):
if x in [0,1,2,3,4,5,6,7,8,9,10,11,12]:
return 'morning'
elif x in [13,14,15,16]:
return 'afternoon'
else:
return 'evening'
loan_perf['period_of_day'] = loan_perf['approved_hour'].map(map_hours)
loan_perf.head()
customerid | systemloanid | loannumber | approveddate | creationdate | loanamount | totaldue | termdays | referredby | good_bad_flag | date_elapsed_in_secs | approved_day | approved_week | approved_hour | period_of_day | |
0 | 8a2a81a74ce8c05d014cfb32a0da1049 | 301994762 | 12 | 2017-07-25 08:22:56 | 2017-07-25 07:22:47 | 30000.0 | 34500.0 | 30 | NaN | 0 | 3609.0 | 25 | 30 | 8 | morning |
1 | 8a85886e54beabf90154c0a29ae757c0 | 301965204 | 2 | 2017-07-05 17:04:41 | 2017-07-05 16:04:18 | 15000.0 | 17250.0 | 30 | NaN | 0 | 3623.0 | 5 | 27 | 17 | evening |
2 | 8a8588f35438fe12015444567666018e | 301966580 | 7 | 2017-07-06 14:52:57 | 2017-07-06 13:52:51 | 20000.0 | 22250.0 | 15 | NaN | 0 | 3606.0 | 6 | 27 | 14 | afternoon |
3 | 8a85890754145ace015429211b513e16 | 301999343 | 3 | 2017-07-27 19:00:41 | 2017-07-27 18:00:35 | 10000.0 | 11500.0 | 15 | NaN | 0 | 3606.0 | 27 | 30 | 19 | evening |
4 | 8a858970548359cc0154883481981866 | 301962360 | 9 | 2017-07-03 23:42:45 | 2017-07-03 22:42:39 | 40000.0 | 44000.0 | 30 | NaN | 0 | 3606.0 | 3 | 27 | 23 | evening |
datasist库还有更多时间处理函数,详见:https://towardsdatascience.com/https-medium-com-risingdeveloper-easy-data-analysis-visualization-and-modeling-using-datasist-part1-8b26526dbe01
地理位置特征处理
数据集中的经纬度,地址都是地理位置特征.
对于经纬度特征可以做很多处理,可以用Geojson或 Geopy的库将位置特征值转为地图上的地址.
但这类方法很慢,本文展示更简单且更快的方法
以下示例来自于Kaggle.
曼哈顿距离: 两点间的水平距离和垂直距离之和
#曼哈顿距离
def manhattan_distance(lat1, lng1, lat2, lng2):
a = np.abs(lat2 -lat1)
b = np.abs(lng1 - lng2)
return a + b
sendy_data['manhattan_dist'] = manhattan_distance(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Temperature | Precipitation in millimeters | Pickup Lat | Pickup Long | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | speed | manhattan_dist | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | 20.4 | NaN | -1.317755 | 36.830370 | -1.300406 | 36.829741 | Rider_Id_432 | 745.0 | 0.002160 | 0.017978 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | 26.4 | NaN | -1.351453 | 36.899315 | -1.295004 | 36.814358 | Rider_Id_856 | 1993.0 | 0.001422 | 0.141406 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | 23.7 | NaN | -1.308284 | 36.843419 | -1.300921 | 36.828195 | Rider_Id_155 | 455.0 | 0.003047 | 0.022588 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | 19.2 | NaN | -1.281301 | 36.832396 | -1.257147 | 36.795063 | Rider_Id_855 | 1341.0 | 0.001717 | 0.061487 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | 15.4 | NaN | -1.266597 | 36.792118 | -1.295041 | 36.809817 | Rider_Id_770 | 1214.0 | 0.001897 | 0.046143 |
5 rows × 31 columns
Haversine距离:是两点在球体上的距离, 在导航中很重要.
#Haversine distance
def haversine_array(lat1, lng1, lat2, lng2):
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
AVG_EARTH_RADIUS = 6371 # in km
lat = lat2 - lat1
lng = lng2 - lng1
d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
return h
sendy_data['haversine_dist'] = haversine_array(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Precipitation in millimeters | Pickup Lat | Pickup Long | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | speed | manhattan_dist | haversine_dist | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | NaN | -1.317755 | 36.830370 | -1.300406 | 36.829741 | Rider_Id_432 | 745.0 | 0.002160 | 0.017978 | 1.930333 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | NaN | -1.351453 | 36.899315 | -1.295004 | 36.814358 | Rider_Id_856 | 1993.0 | 0.001422 | 0.141406 | 11.339849 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | NaN | -1.308284 | 36.843419 | -1.300921 | 36.828195 | Rider_Id_155 | 455.0 | 0.003047 | 0.022588 | 1.880079 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | NaN | -1.281301 | 36.832396 | -1.257147 | 36.795063 | Rider_Id_855 | 1341.0 | 0.001717 | 0.061487 | 4.943458 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | NaN | -1.266597 | 36.792118 | -1.295041 | 36.809817 | Rider_Id_770 | 1214.0 | 0.001897 | 0.046143 | 3.724829 |
5 rows × 32 columns
- Bearing角度: 是从起始点到目的地的转向角度, 必须在 0 到 360之间.
#Bearing
def bearing_array(lat1, lng1, lat2, lng2):
AVG_EARTH_RADIUS = 6371 # in km
lng_delta_rad = np.radians(lng2 - lng1)
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
y = np.sin(lng_delta_rad) * np.cos(lat2)
x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
return np.degrees(np.arctan2(y, x))
sendy_data['bearing'] = bearing_array(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Pickup Lat | Pickup Long | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | speed | manhattan_dist | haversine_dist | bearing | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | -1.317755 | 36.830370 | -1.300406 | 36.829741 | Rider_Id_432 | 745.0 | 0.002160 | 0.017978 | 1.930333 | -2.076903 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | -1.351453 | 36.899315 | -1.295004 | 36.814358 | Rider_Id_856 | 1993.0 | 0.001422 | 0.141406 | 11.339849 | -56.392163 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | -1.308284 | 36.843419 | -1.300921 | 36.828195 | Rider_Id_155 | 455.0 | 0.003047 | 0.022588 | 1.880079 | -64.183866 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | -1.281301 | 36.832396 | -1.257147 | 36.795063 | Rider_Id_855 | 1341.0 | 0.001717 | 0.061487 | 4.943458 | -57.091553 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | -1.266597 | 36.792118 | -1.295041 | 36.809817 | Rider_Id_770 | 1214.0 | 0.001897 | 0.046143 | 3.724829 | 148.114398 |
5 rows × 33 columns
中心点:计算两点经纬度之间的中点
#Get center of lat and longitude
sendy_data['center_latitude'] = (sendy_data['Pickup Lat'].values + sendy_data['Destination Lat'].values) / 2
sendy_data['center_longitude'] = (sendy_data['Pickup Long'].values + sendy_data['Destination Long'].values) / 2
sendy_data.head()
Order No | User Id | Vehicle Type | Platform Type | Personal or Business | Placement - Day of Month | Placement - Weekday (Mo = 1) | Placement - Time | Confirmation - Day of Month | Confirmation - Weekday (Mo = 1) | ... | Destination Lat | Destination Long | Rider Id | Time from Pickup to Arrival | speed | manhattan_dist | haversine_dist | bearing | center_latitude | center_longitude | |
0 | Order_No_4211 | User_Id_633 | Bike | 3 | Business | 9 | 5 | 9:35:46 AM | 9 | 5 | ... | -1.300406 | 36.829741 | Rider_Id_432 | 745.0 | 0.002160 | 0.017978 | 1.930333 | -2.076903 | -1.309080 | 36.830056 |
1 | Order_No_25375 | User_Id_2285 | Bike | 3 | Personal | 12 | 5 | 11:16:16 AM | 12 | 5 | ... | -1.295004 | 36.814358 | Rider_Id_856 | 1993.0 | 0.001422 | 0.141406 | 11.339849 | -56.392163 | -1.323229 | 36.856837 |
2 | Order_No_1899 | User_Id_265 | Bike | 3 | Business | 30 | 2 | 12:39:25 PM | 30 | 2 | ... | -1.300921 | 36.828195 | Rider_Id_155 | 455.0 | 0.003047 | 0.022588 | 1.880079 | -64.183866 | -1.304603 | 36.835807 |
3 | Order_No_9336 | User_Id_1402 | Bike | 3 | Business | 15 | 5 | 9:25:34 AM | 15 | 5 | ... | -1.257147 | 36.795063 | Rider_Id_855 | 1341.0 | 0.001717 | 0.061487 | 4.943458 | -57.091553 | -1.269224 | 36.813730 |
4 | Order_No_27883 | User_Id_1737 | Bike | 1 | Personal | 13 | 1 | 9:55:18 AM | 13 | 1 | ... | -1.295041 | 36.809817 | Rider_Id_770 | 1214.0 | 0.001897 | 0.046143 | 3.724829 | 148.114398 | -1.280819 | 36.800968 |
5 rows × 35 columns
datasist库有更多用法