目录

一、导入数据 

二、数据查看

可视化缺失值占比 

绘制所有变量的柱形图,查看数据

查看各特征与目标变量price的相关性

三、数据处理

 处理异常值

查看seller,offerType的取值

查看特征 notRepairedDamage 

 异常值截断

 填充缺失值 

 删除取值无变化的特征

查看目标变量price

对price做对数log变换 

 四、特征构造

构造新特征:计算某品牌的销售统计量 

  构造新特征:使用时间

对连续型特征数据进行分桶 

对数值型特征做归一化 

 匿名特征交叉

平均数编码 

五、特征筛选 

计算各列于交易价格的相关性 

对类别特征进行 OneEncoder 

切分特征和标签 

用lightgbm筛选特征 


免费的数据挖掘电子书_免费的数据挖掘电子书

 

一、导入数据 

import pandas as pd
import numpy as np
#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
#显示所有列
pd.set_option('display.max_columns',None)
# #显示所有行
# pd.set_option('display.max_rows',None)

Train_data = pd.read_csv("二手汽车价格预测/used_car_train_20200313.csv",sep=' ')
Test_data = pd.read_csv('二手汽车价格预测/used_car_testB_20200421.csv', sep=' ')
Train_data.shape,Test_data.shape#((150000, 31), (50000, 30))
Train_data.tail()
# Test_data.head()

二、数据查看

Train_data.info()

Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SaleID 150000 non-null int64 1 name 150000 non-null int64 2 regDate 150000 non-null int64 3 model 149999 non-null float64 4 brand 150000 non-null int64 5 bodyType 145494 non-null float64 6 fuelType 141320 non-null float64 7 gearbox 144019 non-null float64 8 power 150000 non-null int64 9 kilometer 150000 non-null float64 10 notRepairedDamage 150000 non-null object 11 regionCode 150000 non-null int64 12 seller 150000 non-null int64 13 offerType 150000 non-null int64 14 creatDate 150000 non-null int64 15 price 150000 non-null int64 16 v_0 150000 non-null float64 17 v_1 150000 non-null float64 18 v_2 150000 non-null float64 19 v_3 150000 non-null float64 20 v_4 150000 non-null float64 21 v_5 150000 non-null float64 22 v_6 150000 non-null float64 23 v_7 150000 non-null float64 24 v_8 150000 non-null float64 25 v_9 150000 non-null float64 26 v_10 150000 non-null float64 27 v_11 150000 non-null float64 28 v_12 150000 non-null float64 29 v_13 150000 non-null float64 30 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1)

Train_data.duplicated().sum()#没有重复值
Train_data.isnull().sum()


SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64


bodyType , fuelType,gearbox,model,这几个特征存在缺失值。

可视化缺失值占比 

# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

免费的数据挖掘电子书_汽车_02

绘制所有变量的柱形图,查看数据

Train_data.hist(bins=50,figsize=(20,15))
plt.cla()  #清除axes

免费的数据挖掘电子书_人工智能_03

 图中可以看出,seller,offerType,creatDate这几个特征值分布不均匀,分别查看

查看各特征与目标变量price的相关性

#把字符串类型的变量、以及一些无关的变量去掉,获得需要的列名
numeric_columns=Train_data.select_dtypes(exclude='object').columns
columns=[col for col in numeric_columns if col not in ['SaleID', 'name']]
#根据列名提取数据
train_set=Train_data[columns]
#计算各列于交易价格的相关性
correlation=train_set.corr()
correlation['price'].sort_values(ascending = False)


price 1.000000 v_12 0.692823 v_8 0.685798 v_0 0.628397 regDate 0.611959 gearbox 0.329075 bodyType 0.241303 power 0.219834 fuelType 0.200536 v_5 0.164317 model 0.136983 v_2 0.085322 v_6 0.068970 v_1 0.060914 v_14 0.035911 regionCode 0.014036 creatDate 0.002955 seller -0.002004 v_13 -0.013993 brand -0.043799 v_7 -0.053024 v_4 -0.147085 v_9 -0.206205 v_10 -0.246175 v_11 -0.275320 kilometer -0.440519 v_3 -0.730946 offerType NaN Name: price, dtype: float64


f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)

免费的数据挖掘电子书_python_04

三、数据处理

 处理异常值

查看seller,offerType的取值

Train_data['seller'].value_counts()
#将seller其中的异常值1改为0
Train_data['seller'] = Train_data['seller'][Train_data['seller']==1]=0
Train_data['seller'].value_counts()


0 149999 1 1 Name: seller, dtype: int64


Train_data['offerType'].value_counts()


0 150000 Name: offerType, dtype: int64


可以看出,seller,offerType这两个特征的取值无变化,几乎倒向同一个值,可以删除。

查看特征 notRepairedDamage 

notRepairedDamage 中存在空缺值,但空缺值用“-”表示,所以数据查看发现不了空缺值,将“-”替换成NaN。

Train_data['notRepairedDamage'].value_counts()
Train_data['notRepairedDamage'].replace('-',np.nan,inplace = True)


0.0 111361 - 24324 1.0 14315 Name: notRepairedDamage, dtype: int64


Train_data['notRepairedDamage'].value_counts()


0.0 111361 1.0 14315 Name: notRepairedDamage, dtype: int64


 异常值截断

Train_data['power'].value_counts()


0 12829 75 9593 150 6495 60 6374 140 5963 ... 513 1 1993 1 19 1 751 1 549 1 Name: power, Length: 566, dtype: int64


 power在题目中要求范围

power

发动机功率:范围 [ 0, 600 ]

进行异常值截断 

#异常值截断
Train_data['power'][Train_data['power']>600]=600
Train_data['power'][Train_data['power']<1] = 1
Train_data['v_13'][Train_data['v_13']>6] = 6
Train_data['v_14'][Train_data['v_14']>4] = 4

 填充缺失值 

类别型特征用众数填充缺失值 

print(Train_data.bodyType.mode())
print(Train_data.fuelType.mode())
print(Train_data.gearbox.mode())
print(Train_data.model.mode())


#用众数填补空缺值
Train_data['bodyType']=Train_data['bodyType'].fillna(0)
Train_data['fuelType']=Train_data['fuelType'].fillna(0)
Train_data['gearbox']=Train_data['gearbox'].fillna(0)
Train_data['model']=Train_data['model'].fillna(0)

Train_data.isnull().sum()

 删除取值无变化的特征

'seller','offerType'

#删除取值没有变化的列
Train_data.head()
Train_data = Train_data.drop(['seller','offerType'],axis = 1)
Train_data.head()

查看目标变量price

# 查看目标变量的skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())#偏度
print("Kurtosis: %f" % Train_data['price'].kurt())#峰度

# Train_data.skew(), Train_data.kurt()


Skewness: 3.346487 Kurtosis: 18.995183


免费的数据挖掘电子书_python_05

## 查看目标变量的具体频数
## 绘制标签的统计图,查看标签分布
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

 

免费的数据挖掘电子书_数据挖掘_06

对price的长尾数据进行截取,做对数log变换 

np.log1p ( ) 
数据预处理时首先可以对偏度比较大的数据用log1p函数进行转化,使其更加服从高斯分布,此步处理可能会使我们后续的分类结果得到一个好的结果.

# 目标变量进行对数变换服从正态分布


Train_data['price'] = np.log1p(Train_data['price'])
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())#偏度
print("Kurtosis: %f" % Train_data['price'].kurt())#峰度

免费的数据挖掘电子书_人工智能_07


Skewness: -0.261727 Kurtosis: -0.182127


免费的数据挖掘电子书_人工智能_08

 四、特征构造

4.1、构造新特征:计算某品牌的销售统计量 

# 计算某品牌的销售统计量
Train_gb = Train_data.groupby("brand")
all_info = {}
for kind, kind_data in Train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
Train_data = Train_data.merge(brand_fe, how='left', on='brand')

 4.2、构造新特征:使用时间

一般来说汽车价格与使用时间成反比

# 使用时间:
Train_data['creatDate'] - Train_data['regDate']#一般来说汽车价格与使用时间成反比
# 数据里有时间出错的格式,errors='coerce',遇到不能转换的数据赋值为nan
Train_data['used_time'] = (pd.to_datetime(Train_data['creatDate'], format='%Y%m%d', errors='coerce') - 
                            pd.to_datetime(Train_data['regDate'], format='%Y%m%d', errors='coerce')).dt.days

Train_data['used_time'].isnull().sum()
Train_data['used_time'].mean()#4432.082407160321

#用平均数或众数填充缺失值
Train_data['used_time'].fillna(4432,inplace = True)
Train_data['used_time'].isnull().sum()

4.3、对连续型特征数据进行分桶 

#对连续型数据进行分桶
#对power进行分桶
bin = [i*10 for i in range(31)]#分成30个桶
Train_data['power_bin'] = pd.cut(Train_data['power'], bin, labels=False)
Train_data[['power_bin', 'power']].head()

 kilometer已经分桶了

plt.hist(Train_data['kilometer'])
# 删除不需要的数据
Train_data = Train_data.drop(['name','SaleID', 'regionCode'], axis=1)
Train_data.head()
  • 目前的数据其实已经可以给树模型使用了,所以我们导出一下

Train_data.to_csv('data_for_tree.csv', index=0)

4.5、对数值型特征做归一化 

# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为,不同模型对数据集的要求不同
# 我们看下数据分布:

Train_data['power'].plot.hist()

 

免费的数据挖掘电子书_人工智能_09

# 我们对其取 log,在做归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
Train_data['power'] = np.log1p(Train_data['power'] + 1) 
Train_data['power'] = Train_data['power'] = max_min(Train_data['power'])
Train_data['power'].plot.hist()

免费的数据挖掘电子书_免费的数据挖掘电子书_10

# kilometer做过分桶处理了,所以我们可以直接做归一化
Train_data['kilometer'] =  max_min(Train_data['kilometer'])
Train_data['kilometer'].plot.hist()

免费的数据挖掘电子书_数据挖掘_11

# 对之前构造的以下特征进行归一化
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了,直接做变换,
def max_min(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

# Train_data['brand_amount'] = max_min(Train_data['brand_amount'])
Train_data['brand_price_average'] =  max_min(Train_data['brand_price_average'] )
Train_data['brand_price_max'] =  max_min(Train_data['brand_price_max'])
Train_data['brand_price_median'] =  max_min(Train_data['brand_price_max'])
Train_data['brand_price_min'] =  max_min(Train_data['brand_price_min'])
Train_data['brand_price_std'] =  max_min(Train_data['brand_price_std'])
Train_data['brand_price_sum'] =  max_min(Train_data['brand_price_sum'] )
Train_data.head()

 4.6、匿名特征交叉

#匿名特征交叉
num_cols = [0,2,3,6,8,10,12,14]
for index, value in enumerate(num_cols):
    for j in num_cols[index+1:]:
        Train_data['new'+str(value)+'*'+str(j)]=Train_data['v_'+str(value)]*Train_data['v_'+str(j)]
        Train_data['new'+str(value)+'+'+str(j)]=Train_data['v_'+str(value)]+Train_data['v_'+str(j)]
        Train_data['new'+str(value)+'-'+str(j)]=Train_data['v_'+str(value)]-Train_data['v_'+str(j)]
num_cols1 = [3,5,1,11]

for index, value in enumerate(num_cols1):
    for j in num_cols1[index+1:]:
        Train_data['new'+str(value)+'-'+str(j)]=Train_data['v_'+str(value)]-Train_data['v_'+str(j)]
 

for i in range(15):
    Train_data['new'+str(i)+'*year']=Train_data['v_'+str(i)] * Train_data['used_time']
# 这份数据可以给 LR 用
Train_data.to_csv('Train_data_for_lr.csv', index=0)
Train_data.head()

五、特征筛选 

5.1、查看各列于交易价格的相关性

correlation=Train_data.corr()
x=correlation['price'].sort_values(ascending = False)
y = np.abs(x)>=0.01

5.2、对类别特征进行 OneEncoder 

data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
                                     'gearbox', 'notRepairedDamage', 'power_bin'])
print(data.shape)
data.columns


(200000, 364)


Index(['SaleID', 'name', 'regDate', 'power', 'kilometer', 'regionCode', 'creatDate', 'price', 'v_0', 'v_1', ... 'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0', 'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0', 'power_bin_28.0', 'power_bin_29.0'], dtype='object', length=364)


5.3、切分特征和标签 

#切分特征和标签
train_set=Train_data.copy()
y_train=train_set['price']

x_train=train_set.drop(['price','regDate','creatDate'],axis = 1
                      )
x_train.head()

用lightgbm筛选特征 

import lightgbm as lgb
from sklearn.model_selection import train_test_split 
im
from sklearn.metrics import mean_squared_error as MSE


features = pd.get_dummies(x_train)
feature_names = list(features.columns)
features = np.array(features)
labels = np.array(y_train).reshape((-1, ))
feature_importance_values = np.zeros(len(feature_names))
task='regression'
early_stopping=True
eval_metric= 'l2'
n_iterations=10

for _ in range(n_iterations):
    if task == 'classification':
        model = lgb.LGBMClassifier(n_estimators=1000, learning_rate = 0.05, verbose = -1)
    if task =='regression':
        model = lgb.LGBMRegressor(n_estimators=1000, learning_rate = 0.05, verbose = -1)
    else:
        raise ValueError('Task must be either "classification" or "regression"')
    #提前终止训练,需要验证集
    if early_stopping:
        train_features, valid_features, train_labels, valid_labels = train_test_split(features, labels, test_size = 0.15)
  # Train the model with early stopping
        model.fit(train_features, train_labels, eval_metric = eval_metric,eval_set = [(valid_features, valid_labels)],early_stopping_rounds = 100, verbose = -1)
        gc.enable()
        del train_features, train_labels, valid_features, valid_labels
        gc.collect()
  
    else:
        model.fit(features, labels)
  # Record the feature importances
    feature_importance_values += model.feature_importances_ / n_iterations
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
#按照重要性大小对特征进行排序
feature_importances = feature_importances.sort_values('importance', ascending = False).reset_index(drop = True)

#计算特征的相对重要性,全部特征的相对重要性之和为1
feature_importances['normalized_importance'] = feature_importances['importance'] / feature_importances['importance'].sum()

#计算特征的累计重要性
#cutsum :返回给定 axis 上的累计和
feature_importances['cumulative_importance'] = np.cumsum(feature_importances['normalized_importance'])

#选取累计重要性大于0.99的特征,这些特征将会被删除掉。
drop_columns=list(feature_importances.query('cumulative_importance>0.99')['feature'])
#去掉重要度低的列
x_set=x_train.copy()
x_set.drop(drop_columns,axis=1,inplace=True)

#对数据集总体概览
#显示所有行
pd.set_option("display.max_info_columns", 300)   # 设置info中信息显示数量为200
x_set.info()

六、建模调参

# 构建模型拟合的评价指标
from sklearn.metrics import mean_squared_error,mean_absolute_error

def model_goodness(model,x,y):
    prediction=model.predict(x)
    mae=mean_absolute_error(y,prediction)
    mse=mean_squared_error(y,prediction)
    rmse=np.sqrt(mse)

    print('MAE:',mae)#绝对平均误差
    print('MSE:',mse)#均方差
    print('RMSE:',rmse)#均方根
# 定义模型泛化能力的指标计算函数:
from sklearn.model_selection import cross_val_score
def display_scores(scores):
        print("Scores:", scores)    
        print("Mean:", scores.mean())
        print("Standard deviation:", scores.std())
#先用简单线性回归模型拟合
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(x_set,y_train)

model_goodness(lin_reg,x_set,y_train)
'''MAE: 0.17541397968387218
MSE: 0.07846792179703589
RMSE: 0.28012126266500353'''

随机森林 

from sklearn.ensemble import RandomForestRegressor
forest_reg=RandomForestRegressor()0

forest_reg.fit(x_set,y_train)

model_goodness(forest_reg,x_set,y_train)


# 采用10折交叉验证的方法来验证模型的泛化能力
scores=cross_val_score(forest_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)
mae_scores=np.abs(-scores)
display_scores(mae_scores)

''MAE: 0.047468466346616035
MSE: 0.008013848284210116
RMSE: 0.08952009988941095'''

存在过拟合,

'''Scores: [0.1294032  0.12707153 0.12940989 0.12829302 0.13042102 0.1285104
 0.12762524 0.12703461 0.1289176  0.12968754]
Mean: 0.12863740448866307
Standard deviation: 0.0010828607409916612'''

 GBDT 

# GBDT
from sklearn.ensemble import GradientBoostingRegressor
gbrt=GradientBoostingRegressor()
gbrt.fit(x_set,y_train)

model_goodness(gbrt,x_set,y_train)


scores=cross_val_score(gbrt,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)
mae_scores=np.abs(scores)
display_scores(mae_scores)

MAE: 0.1579591089700307
MSE: 0.06534997589709124
RMSE: 0.2556364134803398

Scores: [0.16032467 0.15964983 0.16159922 0.15899314 0.16286916 0.16034439
 0.15793287 0.1580428  0.15949101 0.16185252]
Mean: 0.16010996168246888
Standard deviation: 0.0015434916175588425

XGBoost

# XGBoost
import lightgbm as lgb
import xgboost as xgb
xgb_reg= xgb.XGBRegressor()
xgb_reg.fit(x_set,y_train)

model_goodness(xgb_reg,x_set,y_train)
scores=cross_val_score(xgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)
mae_scores=np.abs(scores)
display_scores(mae_scores)

'''
MAE: 0.11684430449593118
MSE: 0.03652492452344296
RMSE: 0.1911149510724971
Scores: [0.13500033 0.1333282  0.13477914 0.13414655 0.1365417  0.13534464
 0.13483075 0.13339024 0.1352027  0.13584453]
Mean: 0.1348408781266727
Standard deviation: 0.000958580534103817''' 

LightGBM 

#LightGBM
lgb_reg=lgb.LGBMRegressor()
lgb_reg.fit(x_set,y_train)


model_goodness(lgb_reg,x_set,y_train)

scores=cross_val_score(lgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=10)
mae_scores=np.abs(scores)
display_scores(mae_scores)

'''
MAE: 0.1307250662409778
MSE: 0.049472769306324126
RMSE: 0.22242474976118132
Scores: [0.13610695 0.13486826 0.13710767 0.13597915 0.13788547 0.13687976
 0.13471174 0.13481778 0.13525209 0.13684043]
Mean: 0.13604493148788416
Standard deviation: 0.0010560012820324028'''

还缺个模型调参和模型融合,回头补

调参

1.利用随机搜索对随机森林模型进行调优

利用sklearn.model_selection模块中的RandomizedSearchCV来进行随机搜索,搜索的超参数包括bootstrap,最大特征数max_features,树的最大深度max_depth,n_estimators。

from sklearn.model_selection import RandomizedSearchCV

#2.设置参数空间
from hyperopt import hp
space_forest = {
    'bootstrap':[True,False],
    'max_features':list(range(0,25,1)),
    'max_depth': list(range(0, 100, 1)),
    'n_estimators': list(range(30, 150, 1))
}
#随机搜索,利用5折交叉验证得分来作为模型优劣的判断标准
forest_reg=RandomForestRegressor()
random_search=RandomizedSearchCV(forest_reg, space_forest,cv=5,scoring='neg_mean_squared_error')

#得到最优参数
random_search.best_params_

2.利用贝叶斯方法对LightBoost进行调优
python中的hypreopt包可以进行贝叶斯方法的调优,这篇文章里Python 环境下的自动化机器学习超参数调优,有详细的介绍。 

# 贝叶斯方法对LightBoost进行调优

#2.定义参数空间
from hyperopt import hp
space = {
    'num_leaves': hp.quniform('num_leaves', 30, 150, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'max_depth': hp.quniform('max_depth', 0, 100, 1),
    'n_estimators': hp.quniform('n_estimators', 30, 150, 1)
}

#定义优化函数,即为5折交叉验证的得分
from sklearn.model_selection import cross_val_score
def objective(params, n_folds=5):
    num_leaf=int(params['num_leaves'])
    estimator=int(params['n_estimators'])
    rate=params['learning_rate']
    sub_for_bin=int(params['subsample_for_bin'])
    max_dep=int(params['max_depth'])
    lgb_reg=lgb.LGBMRegressor(num_leaves=num_leaf,n_estimators = estimator,learning_rate=rate,subsample_for_bin=sub_for_bin,max_depth=max_dep)
    lgb_reg.fit(x_set,y_train)
    scores=cross_val_score(lgb_reg,x_set,y_train,scoring='neg_mean_absolute_error',cv=5)
    mae_scores=np.abs(scores)
    loss=mae_scores.mean()
    return loss

#寻找到使优化函数最小超参数组合,利用hyperopt中的fmin来求最小化
from hyperopt import Trials,fmin,tpe
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 500)