kaggle之共享单车案例

自行车共享系统是租借自行车的一种手段,通过这些系统,人们可以从任意地点租借一辆自行车,到达目的地后归还。自行车共享系统明确记录了旅行时间,出发地点,到达地点和时间。因此,其可用于研究城市中的移动性。在本项目中,要求将历史使用模式与天气数据结合起来,以预测华盛顿特区的自行车租赁租赁需求。

数据提供了跨越两年的每小时租赁数据,包含天气信息和日期信息,训练集由每月前19天的数据组成,测试集是每月第20天到当月底的数据。

变量说明:

  • datetime(日期) - 年 、月、 日+ 整点时刻
  • season(季节) - 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
  • holiday - 是否是节假日
  • workingday - 是否是工作日
  • weather(天气等级)1. 清澈,少云,多云。2. 雾+阴天,雾+碎云、雾+少云、雾 3. 小雪、小雨+雷暴+散云,小雨+云 4. 暴雨+冰雹+雷暴+雾,雪+雾
  • temp 温度
  • atemp 体感温度
  • humidity 相对湿度
  • windspeed 风速
  • casual 非用户租赁数量
  • registered 会员租赁数量
  • count 租赁总量

数据探索

  • 缺失值检查
  • 异常值检查
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

#忽略警告提示
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
data_train = pd.read_csv('./data/train.csv')
data_test  = pd.read_csv('./data/test.csv')

data_train.info()
print('-'*40)
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity      6493 non-null int64
windspeed     6493 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.6+ KB

数据没有缺失值

data_train.head()



datetime

season

holiday

workingday

weather

temp

atemp

humidity

windspeed

casual

registered

count

0

2011-01-01 00:00:00

1

0

0

1

9.84

14.395

81

0.0

3

13

16

1

2011-01-01 01:00:00

1

0

0

1

9.02

13.635

80

0.0

8

32

40

2

2011-01-01 02:00:00

1

0

0

1

9.02

13.635

80

0.0

5

27

32

3

2011-01-01 03:00:00

1

0

0

1

9.84

14.395

75

0.0

3

10

13

4

2011-01-01 04:00:00

1

0

0

1

9.84

14.395

75

0.0

0

1

1

data_test.head()



datetime

season

holiday

workingday

weather

temp

atemp

humidity

windspeed

0

2011-01-20 00:00:00

1

0

1

1

10.66

11.365

56

26.0027

1

2011-01-20 01:00:00

1

0

1

1

10.66

13.635

56

0.0000

2

2011-01-20 02:00:00

1

0

1

1

10.66

13.635

56

0.0000

3

2011-01-20 03:00:00

1

0

1

1

10.66

12.880

56

11.0014

4

2011-01-20 04:00:00

1

0

1

1

10.66

12.880

56

11.0014

# 统计描述
data_train.describe().T



count

mean

std

min

25%

50%

75%

max

season

10886.0

2.506614

1.116174

1.00

2.0000

3.000

4.0000

4.0000

holiday

10886.0

0.028569

0.166599

0.00

0.0000

0.000

0.0000

1.0000

workingday

10886.0

0.680875

0.466159

0.00

0.0000

1.000

1.0000

1.0000

weather

10886.0

1.418427

0.633839

1.00

1.0000

1.000

2.0000

4.0000

temp

10886.0

20.230860

7.791590

0.82

13.9400

20.500

26.2400

41.0000

atemp

10886.0

23.655084

8.474601

0.76

16.6650

24.240

31.0600

45.4550

humidity

10886.0

61.886460

19.245033

0.00

47.0000

62.000

77.0000

100.0000

windspeed

10886.0

12.799395

8.164537

0.00

7.0015

12.998

16.9979

56.9969

casual

10886.0

36.021955

49.960477

0.00

4.0000

17.000

49.0000

367.0000

registered

10886.0

155.552177

151.039033

0.00

36.0000

118.000

222.0000

886.0000

count

10886.0

191.574132

181.144454

1.00

42.0000

145.000

284.0000

977.0000

异常值检查

  • count
  • casual
  • registered
# 查看是否符合高斯分布
fig,axes = plt.subplots(1,3)
# 设置图形的尺寸,单位为英寸。1英寸等于2.54cm
fig.set_size_inches(18,5)

sns.distplot(data_train['count'],bins=100,ax=axes[0])
sns.distplot(data_train['casual'],bins=100,ax=axes[1])
sns.distplot(data_train['registered'],bins=100,ax=axes[2])

共享单车hadoop摘要 共享单车项目摘要_机器学习

data_train[['count','casual','registered']].describe().T



count

mean

std

min

25%

50%

75%

max

count

10886.0

191.574132

181.144454

1.0

42.0

145.0

284.0

977.0

casual

10886.0

36.021955

49.960477

0.0

4.0

17.0

49.0

367.0

registered

10886.0

155.552177

151.039033

0.0

36.0

118.0

222.0

886.0

fig,axes = plt.subplots(1,3)
fig.set_size_inches(12,6)

sns.boxplot(data = data_train['count'],ax=axes[0])
axes[0].set(xlabel='count')
sns.boxplot(data = data_train['casual'], ax=axes[1])
axes[1].set(xlabel='casual')
sns.boxplot(data = data_train['registered'], ax=axes[2])
axes[2].set(xlabel='registered')

共享单车hadoop摘要 共享单车项目摘要_数据_02

count:均值191,标准差181,50%分位数是145,75%分位数是284,最大值977,说明右侧存在长尾。去除掉异常值,并取log处理,观察结果。

count = casual+registered

# 去除异常值 将大于μ+3σ的数据值作为异常值
def drop_outlier(data,col):
    mask = np.abs(data[col]-data[col].mean())<(3*data[col].std())
    data = data.loc[mask]
    # 可视化剔除异常值后的col和col_log
    data[col+'_log'] = np.log1p(data[col])
    f, [ax1, ax2] = plt.subplots(1,2, figsize=(15,6))

    sns.distplot(data[col], ax=ax1)
    ax1.set_title(col+'分布')

    sns.distplot(data[col+'_log'], ax=ax2)
    ax2.set_title(col+'_log分布')
    return data
data_train = drop_outlier(data_train,'count')

共享单车hadoop摘要 共享单车项目摘要_机器学习_03

data_train = drop_outlier(data_train,'casual')

共享单车hadoop摘要 共享单车项目摘要_Kaggle_04

data_train = drop_outlier(data_train,'registered')

共享单车hadoop摘要 共享单车项目摘要_租借自行车_05

特征分解

将datetime特征拆分为日期、星期、年、月、日、小时

def split_datetime(data):
    data['date'] = data['datetime'].apply(lambda x:x.split()[0])
    data['weekday'] =data['date'].apply(lambda x:datetime.strptime(x,'%Y-%m-%d').isoweekday())
    data['year'] = data['date'].apply(lambda x:x.split('-')[0]).astype('int')
    data['month'] = data['date'].apply(lambda x:x.split('-')[1]).astype('int')
    data['day'] = data['date'].apply(lambda x:x.split('-')[2]).astype('int')
    data['hour'] = data['datetime'].apply(lambda x:x.split()[1].split(':')[0]).astype('int')
    return data
data_train = split_datetime(data_train)
data_train.head()



datetime

season

holiday

workingday

weather

temp

atemp

humidity

windspeed

casual

registered

count

count_log

date

weekday

year

month

day

hour

0

2011-01-01 00:00:00

1

0

0

1

9.84

14.395

81

0.0

3

13

16

2.833213

2011-01-01

6

2011

1

1

0

1

2011-01-01 01:00:00

1

0

0

1

9.02

13.635

80

0.0

8

32

40

3.713572

2011-01-01

6

2011

1

1

1

2

2011-01-01 02:00:00

1

0

0

1

9.02

13.635

80

0.0

5

27

32

3.496508

2011-01-01

6

2011

1

1

2

3

2011-01-01 03:00:00

1

0

0

1

9.84

14.395

75

0.0

3

10

13

2.639057

2011-01-01

6

2011

1

1

3

4

2011-01-01 04:00:00

1

0

0

1

9.84

14.395

75

0.0

0

1

1

0.693147

2011-01-01

6

2011

1

1

4

可视化分析

  • 数值型数据分布分析
  • 类别型数据箱线图分布分析
data_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10739 entries, 0 to 10885
Data columns (total 19 columns):
datetime      10739 non-null object
season        10739 non-null int64
holiday       10739 non-null int64
workingday    10739 non-null int64
weather       10739 non-null int64
temp          10739 non-null float64
atemp         10739 non-null float64
humidity      10739 non-null int64
windspeed     10739 non-null float64
casual        10739 non-null int64
registered    10739 non-null int64
count         10739 non-null int64
count_log     10739 non-null float64
date          10739 non-null object
weekday       10739 non-null int64
year          10739 non-null int64
month         10739 non-null int64
day           10739 non-null int64
hour          10739 non-null int64
dtypes: float64(4), int64(13), object(2)
memory usage: 2.0+ MB
fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],bins=60,ax=axes[0,0])
sns.distplot(data_train['atemp'],bins=60,ax=axes[0,1])
sns.distplot(data_train['humidity'],bins=60,ax=axes[1,0])
sns.distplot(data_train['windspeed'],bins=60,ax=axes[1,1])

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_06

fig,axes = plt.subplots(2,2)
fig.set_size_inches(15,12)

sns.boxplot(x='season', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,0])
sns.boxplot(x='holiday', y='count', data = data_train, orient='v', width=0.6, ax=axes[0,1])
sns.boxplot(x='workingday', y='count', data = data_train, orient='v', width=0.6, ax=axes[1,0])
sns.boxplot(x='weather',y='count',data=data_train,orient='v',width=0.6,ax=axes[1,1])

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_07

data_train['windspeed'].describe()
count    10739.000000
mean        12.787706
std          8.171075
min          0.000000
25%          7.001500
50%         12.998000
75%         16.997900
max         56.996900
Name: windspeed, dtype: float64
data_train.boxplot(['windspeed'])

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_08

通过上图发现风速0的数据很多,可能数据本身是有缺失值的,但是用0填充了。这里我们使用随即森林进行填充风速为0的值进行填充。

np.sum(data_train['windspeed'] == 0),data_train['windspeed'].shape[0]
(1253, 10264)
# 使用随机森林填充风速
from sklearn.ensemble import RandomForestRegressor

def RFG_windspeed(data):
    # 将数据分成风速等于0和不等于0的两部分
    mask = data['windspeed'] == 0
    wind_0 = data[mask]
    wind_1 = data[~mask]
    
    if len(wind_0.index)==0:
        return data

    Model_wind = RandomForestRegressor(n_estimators=1000,random_state=42)

    # 选取特征
    cols = ["season","weather","humidity","month","temp","year","atemp"]
    windspeed_X = wind_1[cols]
    # 预测值
    windspeed_y = wind_1['windspeed']

    windspeedpre_X = wind_0[cols]

    Model_wind.fit(windspeed_X,windspeed_y)

    # 预测风速
    wind_0Values = Model_wind.predict(X=windspeedpre_X)

    # 填充
    wind_0.loc[:,'windspeed'] = wind_0Values
    data = wind_1.append(wind_0).reset_index()
    data.drop('index',inplace=True,axis=1)
    return data
data_train['windspeed'].head()
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: windspeed, dtype: float64
data_train = RFG_windspeed(data_train)
data_train['windspeed'].head()
0     6.0032
1    16.9979
2    19.0012
3    19.0012
4    19.9995
Name: windspeed, dtype: float64

再观察一下这四个特征的密度分布

fig,axes = plt.subplots(2,2)
fig.set_size_inches(16,14)

sns.distplot(data_train['temp'],ax=axes[0,0])
axes[0,0].set(xlabel='temp')

sns.distplot(data_train['atemp'],ax=axes[0,1])
axes[0,1].set(xlabel='atemp')

sns.distplot(data_train['humidity'],ax=axes[1,0])
axes[1,0].set(xlabel='humidity')

sns.distplot(data_train['windspeed'],ax=axes[1,1])
axes[1,1].set(xlabel='windseed')
[Text(0.5,0,'windseed')]

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_09

整体看一下租赁额相关的三个值和其他特征值的关系

# 使用seaborn的整体关系图
cols =['season','holiday','workingday','weekday','weather','temp',
       'atemp','humidity','windspeed','hour']

sns.pairplot(data_train ,x_vars=cols,
             y_vars=['casual','registered','count'], 
             plot_kws={'alpha': 0.2})

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_10

  • season(季节) 1 =春, 2 = 夏, 3 = 秋, 4 = 冬
  • holiday 节假日
  • workingday 工作日
  • weather 天气等级
  • temp 温度
  • atemp 体感温度
  • humidity 相对湿度
  • windspeed 风速
  • casual 非用户租赁数量
  • registered 会员租赁数量
  • count 租赁总量

可以观察到:

  1. 一季度出行人数总体偏少
  2. 非假日借车总数比假日借车总数要高
  3. 会员在工作日出行多,节假日出行少,临时用户则相反
  4. 租赁数量随天气等级上升而减少
  5. 温度、湿度对非会员影响较大,对会员影响较小
  6. 小时数对租赁情况影响明显,会员呈现两个高峰,非会员呈正态分布

查看各特征与count的相关性

corr = data_train.corr()
plt.subplots(figsize=(14,14))
sns.heatmap(corr,annot=True,vmax=1,cmap='YlGnBu')

共享单车hadoop摘要 共享单车项目摘要_Kaggle_11

# 降序
np.abs(corr['count']).sort_values(ascending=False)
count             1.000000
registered        0.977642
count_log         0.845294
registered_log    0.839080
casual_log        0.758863
casual            0.711105
hour              0.442659
temp              0.376656
atemp             0.372705
humidity          0.307362
month             0.176115
season            0.173657
year              0.171519
weather           0.123578
windspeed         0.116767
workingday        0.046246
weekday           0.022100
day               0.015243
holiday           0.002421
Name: count, dtype: float64

可以看出特征对count的影响力度分别为:
hour(时段)>temp(温度)>atemp(体感温度)>humidity(湿度)>month(月份)>season(季节)>year(年份)
>weather(天气等级)>windspeed(风速)>workingday(工作日)>weekday(星期几)>day(天数)>(holiday)节假日

hour与count

# hour总体变化趋势
date = data_train.groupby(['hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
# 使用总量
plt.plot(date['hour'], date['count'], linewidth=1.3)
# 会员使用量
plt.plot(date['hour'], date['registered'], linewidth=1.3)
# 非会员使用量
plt.plot(date['hour'], date['casual'], linewidth=1.3)
plt.legend()

共享单车hadoop摘要 共享单车项目摘要_机器学习_12

# 工作日与非工作日下,hour与count的关系
date = data_train.groupby(['workingday','hour'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

mask = date['workingday'] == 1

workingday_date= date[mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)
nworkingday_date = date[~mask].drop(['workingday','hour'],axis=1).reset_index(drop=True)

fig, axes = plt.subplots(1,2,sharey = True)
workingday_date.plot(figsize=(15,5),title ='working day',ax=axes[0])
axes[0].set(xlabel='hour')
nworkingday_date.plot(figsize=(15,5),title ='nonworkdays',ax=axes[1])
axes[1].set(xlabel='hour')

共享单车hadoop摘要 共享单车项目摘要_机器学习_13

可以看出:

  • 工作日
  1. 会员用户(registered)上下班时间是两个用车高峰,而中午也会有一个小高峰,猜测可能是外出午餐的人。
  2. 临时用户(casual)起伏比较平缓,高峰期在17点左右。
  3. 会员用户(registered)的用车数量远超临时用户(casual)。
  • 非工作日
  1. 租赁数量(count)随时间呈现一个正态分布,高峰在12点左右,低谷在4点左右,且分布比较均匀。

温度与count

可视化温度这两年的总体走势

# 数据按天汇总取一天的气温中位数
temp_df = data_train.groupby(['date','weekday'],as_index=False).agg({'year':'mean',
                                                                     'month':'mean',
                                                                     'temp':'median'})
# 缺失的数据丢弃
# temp_df.dropna (axis=0,how ='any',inplace=True)

# 预计按天统计的波动仍然很大,再按月取日平均值
temp_month = temp_df.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                    'temp':'median'})

# 将按天求和统计数据的日期转换成datetime格式
temp_df['date']=pd.to_datetime(temp_df['date'])

# 将按月统计数据设置一列时间序列
temp_month.rename(columns={'weekday':'day'},inplace=True)
temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])


# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随时间的走势
plt.plot(temp_df['date'] , temp_df['temp'], linewidth=1.3, label='日均')
ax.set_title('两年平均每天温度变化趋势')
plt.plot(temp_month['date'] , temp_month['temp'], marker='o',
         linewidth=1.3,label='月均')
ax.legend()

共享单车hadoop摘要 共享单车项目摘要_租借自行车_14

可以看出每年的气温变化趋势相同,在7月份气温最高,1月份气温最低。再看一下每小时平均租赁数量随温度变化的趋势

# 按温度取平均值
temp = data_train.groupby(['temp'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
temp.plot(figsize=(10,5),title='温度与count的变化趋势')

共享单车hadoop摘要 共享单车项目摘要_租借自行车_15

可观察到在气温4度时,count达到最低点,然后随气温上升租车数量总体呈现上升趋势,但在气温超过35时开始下降。

湿度与count

# 可视化湿度这两年的总体走势
humidity_df = data_train.groupby(['date'],as_index=False).agg({'humidity':'mean'})
humidity_df['date']=pd.to_datetime(humidity_df['date'])

# 将日期设置为时间索引
humidity_df = humidity_df.set_index('date')

humidity_month = data_train.groupby(['year','month'],as_index=False).agg({'weekday':'min',
                                                                         'humidity':'mean'})

# 将按月统计数据设置一列时间序列
humidity_month.rename(columns={'weekday':'day'},inplace=True)
humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随湿度的走势
ax.set_title('两年平均每天湿度变化趋势')
plt.plot(humidity_df.index,humidity_df['humidity'], linewidth=1.3, label='日均')
plt.plot(humidity_month['date'],humidity_month['humidity'], marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

共享单车hadoop摘要 共享单车项目摘要_Kaggle_16

观察一下租赁人数随湿度变化趋势,按湿度对租赁数量取平均值。

# 湿度
humidity = data_train.groupby(['humidity'], as_index=True).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})
humidity.plot(figsize=(10,5),title='湿度与count的变化趋势')

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_17

可以观察到在湿度20左右租赁数量迅速达到高峰值,此后缓慢递减

year、month与count

# 先观察两年时间里,总租车数量随时间变化的趋势
count_df = data_train.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
                                                                      'month':'mean',
                                                                      'casual':'sum',
                                                                      'registered':'sum',
                                                                       'count':'sum'})

# 按天统计的波动仍然很大,再按月取日平均值
count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                      'casual':'mean', 
                                                                      'registered':'mean',
                                                                      'count':'mean'})

# 将按天求和统计数据的日期转换成datetime格式
count_df['date']=pd.to_datetime(count_df['date'])

# 将按月统计数据设置一列时间序列
count_month.rename(columns={'weekday':'day'},inplace=True)
count_month['date']=pd.to_datetime(count_month[['year','month','day']])

# 设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

# 使用折线图展示总体租赁情况(count)随时间的走势
ax.set_title('这两年count随时间的总体趋势')
plt.plot(count_df['date'],count_df['count'],linewidth=1.3,label='日均')

plt.plot(count_month['date'],count_month['count'],marker='o',
         linewidth=1.3,label='月均')
plt.grid()
ax.legend()

共享单车hadoop摘要 共享单车项目摘要_机器学习_18

可以看出:

  1. 共享单车的租赁情况是2012年整体比2011年有增涨的;
  2. 租赁情况随月份波动明显;
  3. 数据在2011年9到12月,2012年3到9月间波动剧烈;
  4. 有很多局部波谷值。
# 月使用量变化趋势
date = data_train.groupby(['month'], as_index=False).agg({'count':'mean',
                                                   'registered':'mean',  
                                                   'casual':'mean'})

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(date['month'], date['count'] , linewidth=1.3 , label = '使用总量' )
plt.plot(date['month'], date['registered'] , linewidth=1.3 , label = '会员使用量' )
plt.plot(date['month'], date['casual'] , linewidth=1.3 , label = '非会员使用量' )
plt.legend()

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_19

季节与count

day_df=data_train.groupby('date').agg({'year':'mean','season':'mean',
                                      'casual':'sum', 'registered':'sum'
                                      ,'count':'sum','temp':'mean',
                                      'atemp':'mean'})
season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 
                                                                  'registered':'mean',
                                                                  'count':'mean'})
temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 
                                                                'atemp':'mean'})

fig = plt.figure(figsize=(10,10))
xlables = season_df.index.map(lambda x:str(x))

ax1 = fig.add_subplot(2,1,1)
ax1.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,season_df)
plt.legend(['casual','registered','count'])

ax2 = fig.add_subplot(2,1,2)
ax2.set_title('这两年count随季节的总体趋势')
plt.plot(xlables,temp_df)

plt.legend(['temp','atemp'])

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_20

可以看出无论是临时用户还是会员用户用车的数量都在秋季迎来高峰,而春季度用户数量最低

天气与count

考虑到不同天气的天数不同,例如非常糟糕的天气(4)会很少出现,查看一下不同天气等级的数据条数,再对租赁数量按天气等级取每小时平均值

count_weather = data_train.groupby('weather')
count_weather[['casual','registered','count']].count()



casual

registered

count

weather

1

6719

6719

6719

2

2705

2705

2705

3

839

839

839

4

1

1

1

weather_df = data_train.groupby('weather',as_index=True).agg({'casual':'mean',
                                                              'registered':'mean'})
weather_df.plot.bar(stacked=True)

共享单车hadoop摘要 共享单车项目摘要_机器学习_21

发现天气等级为4的时候,租车数量也很多,感觉不太合常理,打印对应数据观察一下。

data_train[data_train['weather']==4].T



4863

datetime

2012-01-09 18:00:00

season

1

holiday

0

workingday

1

weather

4

temp

8.2

atemp

11.365

humidity

86

windspeed

6.0032

casual

6

registered

158

count

164

count_log

5.10595

casual_log

1.94591

registered_log

5.0689

date

2012-01-09

weekday

1

year

2012

month

1

day

9

hour

18

发现是周一下班高峰期,所以是异常数据

windspeed和count

# 这两年风速的总体变化趋势
windspeed_df = data_train.groupby('date',as_index=False).agg({'windspeed':'mean'})
windspeed_df['date'] = pd.to_datetime(windspeed_df['date'])

# 将日期设置为时间索引
windspeed_df = windspeed_df.set_index('date')


windspeed_month = data_train.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                           'windspeed':'mean'})
windspeed_month.rename(columns={'weekday':'day'},inplace=True)
windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(windspeed_df.index, windspeed_df['windspeed'] , linewidth=1.3,label='日均')
plt.plot(windspeed_month['date'], windspeed_month['windspeed'],
         marker='o', linewidth=1.3,label='月均')
ax.legend()
ax.set_title('这两年风速的总体变化趋势')

共享单车hadoop摘要 共享单车项目摘要_数据_22

可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大,观察一下租赁人数随风速变化趋势,考虑到风速特别大的时候很少,如果取平均值会出现异常,所以按风速对租赁数量取最大值。

# 风速 
# 化为整数型数据
data_train['windspeed'] = data_train['windspeed'].astype(int)
windspeed = data_train.groupby(['windspeed'], as_index=True).agg({'count':'mean',
                                                                  'registered':'mean',  
                                                                  'casual':'mean'})
windspeed.plot(figsize=(10,8))

共享单车hadoop摘要 共享单车项目摘要_Kaggle_23

可以看到租赁数量随风速越大租赁数量越少,在风速超过18的时候明显减少,但风速在风速20左右却有一次反弹,应该是和天气情况一样存在异常的数据,打印异常数据观察一下

df2=data_train[data_train['windspeed']>40]
df2=df2[df2['count']>150]
df2



datetime

season

holiday

workingday

weather

temp

atemp

humidity

windspeed

casual

...

count

count_log

casual_log

registered_log

date

weekday

year

month

day

hour

760

2011-02-19 14:00:00

1

0

0

1

18.86

22.725

15

43

102

...

196

5.283204

4.634729

4.553877

2011-02-19

6

2011

2

19

14

761

2011-02-19 15:00:00

1

0

0

1

18.04

21.970

16

50

84

...

171

5.147494

4.442651

4.477337

2011-02-19

6

2011

2

19

15

2447

2011-07-03 17:00:00

3

0

0

3

32.80

37.120

49

56

181

...

358

5.883322

5.204007

5.181784

2011-07-03

7

2011

7

3

17

2448

2011-07-03 18:00:00

3

0

0

3

32.80

37.120

49

56

74

...

181

5.204007

4.317488

4.682131

2011-07-03

7

2011

7

3

18

2941

2011-08-07 17:00:00

3

0

0

3

30.34

35.605

74

43

63

...

194

5.273000

4.158883

4.882802

2011-08-07

7

2011

8

7

17

5590

2012-03-05 18:00:00

1

0

1

3

11.48

11.365

55

43

12

...

375

5.929589

2.564949

5.897154

2012-03-05

1

2012

3

5

18

5652

2012-03-08 13:00:00

1

0

1

2

24.60

31.060

49

43

35

...

233

5.455321

3.583519

5.293305

2012-03-08

4

2012

3

8

13

5653

2012-03-08 14:00:00

1

0

1

2

25.42

31.060

43

43

48

...

203

5.318120

3.891820

5.049856

2012-03-08

4

2012

3

8

14

5654

2012-03-08 15:00:00

1

0

1

1

26.24

31.060

38

46

24

...

185

5.225747

3.218876

5.087596

2012-03-08

4

2012

3

8

15

5655

2012-03-08 16:00:00

1

0

1

2

25.42

31.060

41

43

37

...

342

5.837730

3.637586

5.723585

2012-03-08

4

2012

3

8

16

5656

2012-03-08 17:00:00

1

0

1

1

25.42

31.060

38

43

52

...

597

6.393591

3.970292

6.302619

2012-03-08

4

2012

3

8

17

6015

2012-04-09 12:00:00

2

0

1

1

22.14

25.760

28

47

94

...

280

5.638355

4.553877

5.231109

2012-04-09

1

2012

4

9

12

7880

2012-09-18 10:00:00

3

0

1

3

27.88

31.820

79

43

30

...

160

5.081404

3.433987

4.875197

2012-09-18

2

2012

9

18

10

7881

2012-09-18 11:00:00

3

0

1

2

27.88

31.820

79

43

36

...

151

5.023881

3.610918

4.753590

2012-09-18

2

2012

9

18

11

14 rows × 21 columns

日期对出行的影响

考虑到相同日期是否工作日,星期几,以及所属年份等信息是一样的,把租赁数据按天求和,其它日期类数据取平均值

day_df = data_train.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum',
                                                          'count':'sum', 'workingday':'mean',
                                                          'weekday':'mean','holiday':'mean',
                                                          'year':'mean'})
day_df.head()



date

casual

registered

count

workingday

weekday

holiday

year

0

2011-01-01

331

654

985

0

6

0

2011

1

2011-01-02

131

670

801

0

7

0

2011

2

2011-01-03

120

1229

1349

1

1

0

2011

3

2011-01-04

108

1454

1562

1

2

0

2011

4

2011-01-05

82

1518

1600

1

3

0

2011

number_pei=day_df[['casual','registered']].mean()
number_pei
casual         657.543860
registered    3040.800439
dtype: float64
# 将横、纵坐标轴标准化处理,保证饼图是一个正圆,否则为椭圆
plt.axes(aspect='equal')
plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.05 , radius=1)  
plt.title('Casual or registered in the total lease')

共享单车hadoop摘要 共享单车项目摘要_Kaggle_24

由于工作日和休息日的天数差别,对工作日和非工作日租赁数量取了平均值,对一周中每天的租赁数量求和

workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 
                                                                 'registered':'mean'})
workingday_df_0 = workingday_df.loc[0]
workingday_df_1 = workingday_df.loc[1]

# plt.axes(aspect='equal')
fig = plt.figure(figsize=(8,6)) 
plt.subplots_adjust(hspace=0.5, wspace=0.2)     #设置子图表间隔
grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5)   #设置子图表坐标轴 对齐

plt.subplot2grid((2,2),(1,0), rowspan=2)
width = 0.3       # 设置条宽

p1 = plt.bar(workingday_df.index,workingday_df['casual'], width)
p2 = plt.bar(workingday_df.index,workingday_df['registered'], 
             width,bottom=workingday_df['casual'])
plt.title('Average number of rentals initiated per day')
plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20)
plt.legend((p1[0], p2[0]), ('casual', 'registered'))

plt.subplot2grid((2,2),(0,0))
plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.axis('equal') 
plt.title('nonworking day')

plt.subplot2grid((2,2),(0,1))
plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.title('working day')
plt.axis('equal')
(-1.438451504893538,
 1.4304024814759062,
 -1.4388335098293494,
 1.4343901442970892)

共享单车hadoop摘要 共享单车项目摘要_机器学习_25

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'})
weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_26

1.工作日会员用户出行数量较多,临时用户出行数量较少;
2.周末会员用户租赁数量降低,临时用户租赁数量增加。

节假日
由于节假日在一年中数量占比非常少,先来看一每年的节假日下有几天,

holiday_coun=day_df.groupby('year', as_index=True).agg({'holiday':'sum'})
holiday_coun



holiday

year

2011

6

2012

7

假期的天数占一年天数的份额十分少,所以对假期和非假期取日平均值

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'})
holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated ')

共享单车hadoop摘要 共享单车项目摘要_数据_27

特征工程

import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime

train = pd.read_csv('./data/data31405/train.csv')
test  = pd.read_csv('./data/data31405/test.csv')
#训练集去除3倍方差以外数据
train_std = train[np.abs(train['count']-train['count'].mean())<=(3*train['count'].std())]

train_std.reset_index(drop=True,inplace=True)
train_std.shape
(10739, 12)
#对数据进行对数变换后的分布
ylabels = train_std['count']
ylabels_log = np.log(ylabels)
sns.distplot(ylabels_log)

共享单车hadoop摘要 共享单车项目摘要_机器学习_28

#将train_std、test 合并,便于修改

#index都没有实际含义,使用ignore_inde
combine_train_test = train_std.append(test,ignore_index=True)
datetimecol = test['datetime']
print ('合并后的数据集:',combine_train_test.shape)
合并后的数据集: (17232, 12)
# 记录数据的行数 0表示行,1表示列
row_train = train_std.shape[0]
row_test = test.shape[0]
print('训练集行数:',row_train,'\n测试集行数:',row_test)
训练集行数: 10739 
测试集行数: 6493
# datetime特征拆分
combine_train_test = split_datetime(combine_train_test)
# 填充风速 注意会打乱数据顺序
combine_train_test = RFG_windspeed(combine_train_test)

根据前面的观察,决定将时段(hour)、温度(temp)、湿度(humidity)、年份(year)、月份(month)、季节(season)、天气等级(weather)、风速(windspeed)、星期几(weekday)、是否工作日(workingday)、是否假日(holiday),11项作为特征值由于CART决策树使用二分类,所以将多类别型数据使用one-hot转化成多个二分型类别

combine_feature = combine_train_test[['temp','humidity','weather','season','year','weather',
                                      'month','weekday','hour','workingday','windspeed','count']]

combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 12 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weather       17232 non-null int64
season        17232 non-null int64
year          17232 non-null int64
weather       17232 non-null int64
month         17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
dtypes: float64(3), int64(9)
memory usage: 1.6 MB
# 将多类别型数据使用one-hot转化成多个二分型类别
cols = ['month','season','weather','year']
combine_feature = pd.get_dummies(combine_feature,columns=cols,prefix_sep='_')

combine_feature.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17232 entries, 0 to 17231
Data columns (total 33 columns):
temp          17232 non-null float64
humidity      17232 non-null int64
weekday       17232 non-null int64
hour          17232 non-null int64
workingday    17232 non-null int64
windspeed     17232 non-null float64
count         10739 non-null float64
month_1       17232 non-null uint8
month_2       17232 non-null uint8
month_3       17232 non-null uint8
month_4       17232 non-null uint8
month_5       17232 non-null uint8
month_6       17232 non-null uint8
month_7       17232 non-null uint8
month_8       17232 non-null uint8
month_9       17232 non-null uint8
month_10      17232 non-null uint8
month_11      17232 non-null uint8
month_12      17232 non-null uint8
season_1      17232 non-null uint8
season_2      17232 non-null uint8
season_3      17232 non-null uint8
season_4      17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
weather_1     17232 non-null uint8
weather_2     17232 non-null uint8
weather_3     17232 non-null uint8
weather_4     17232 non-null uint8
year_2011     17232 non-null uint8
year_2012     17232 non-null uint8
dtypes: float64(3), int64(4), uint8(26)
memory usage: 1.3 MB

构建模型

#将数据集拆分为训练集和测试集,注意之前用随机深林填充风速,打乱了数据顺序
mask = pd.notnull(combine_feature['count'])
train_data = combine_feature[mask]
test_data = combine_feature[~mask]

train_data.shape,test_data.shape
((10739, 33), (6493, 33))
# source特征
source_X = train_data.drop(['count'],axis = 1)

# source标签
source_y  = np.log1p(train_data['count'])

# 测试集特征
pred_X = test_data.drop(['count'],axis = 1)

模型

from sklearn.model_selection import GridSearchCV

# 评价函数
def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, # 要搜索的模型
                        params, # 要尝试的参数
                        n_jobs=-1,
                        error_score=0.) # 如果报错,结果是0
    grid.fit(X, y) # 拟合模型和参数
    # 经典的性能指标
    print("Best Accuracy: {}".format(grid.best_score_))
    # 得到最佳准确率的最佳参数
    print("Best Parameters: {}".format(grid.best_params_))
    # 拟合的平均时间(秒)
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # 预测的平均时间(秒)
    # 从该指标可以看出模型在真实世界的性能
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    return grid
from sklearn.model_selection import train_test_split 

# 划分数据集
train_X, test_X, train_y, test_y = train_test_split(source_X,
                                                    source_y,
                                                    train_size = 0.80)

#输出数据集大小
print ('原始数据集特征:',source_X.shape, '训练数据集特征:',train_X.shape,'测试数据集特征:',test_X.shape)

print ('原始数据集标签:',source_y.shape, '训练数据集标签:',train_y.shape,'测试数据集标签:',test_y.shape)
原始数据集特征: (10739, 32) 训练数据集特征: (8591, 32) 测试数据集特征: (2148, 32)
原始数据集标签: (10739,) 训练数据集标签: (8591,) 测试数据集标签: (2148,)

随机森林

from sklearn.ensemble import RandomForestRegressor


# 模型参数
forest_parmas = {'n_estimators':[1300,1500,1700], 'max_depth':range(20,30,4)}

Model = RandomForestRegressor(oob_score=True,n_jobs=-1,random_state = 42)

Model = get_best_model_and_accuracy(Model,forest_parmas ,train_X, train_y)
Best Accuracy: 0.9495761615636888
Best Parameters: {'max_depth': 24, 'n_estimators': 1500}
Average Time to Fit (s): 33.625
Average Time to Score (s): 1.056
Model=Model.best_estimator_
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=24,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1500, n_jobs=-1,
           oob_score=True, random_state=42, verbose=0, warm_start=False)
# 分类问题,score得到的是模型的正确率
Model.score(test_X,test_y)
0.9506555663644386
# 袋外分数
Model.oob_score_
0.9544882922113979
# 模型保存
from sklearn.externals import joblib

joblib.dump(Model, "rf.pkl", compress=9)

xgboost

import xgboost as xg

# 模型参数  subsample:对于每棵树,随机采样的比例
xg_parmas = {'subsample':[i/10.0 for i in range(6,10)],
            'colsample_bytree':[i/10.0 for i in range(6,10)]} # 控制每棵随机采样的列数的占比

xg_model = xg.XGBRegressor(max_depth=8,min_child_weight=6,gamma=0.4)

xg_model = get_best_model_and_accuracy(xg_model,xg_parmas,train_X.values, train_y.values)
Best Accuracy: 0.9519465121003385
Best Parameters: {'colsample_bytree': 0.9, 'subsample': 0.9}
Average Time to Fit (s): 1.136
Average Time to Score (s): 0.019
xg_model=xg_model.best_estimator_
xg_model
XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.9, gamma=0.4, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='reg:squarederror', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=0.9, tree_method=None,
       validate_parameters=False, verbosity=None)
from sklearn.metrics import mean_absolute_error

pre_y = xg_model.predict(test_X.values)
mean_absolute_error(pre_y,test_y.values)
0.20327569987981398

learning curve

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, 
                        train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True):
    """
    画出data在某模型上的learning curve.
    参数解释
    ----------
    estimator : 分类器。
    title : 表格的标题。
    X : 输入的feature,numpy类型
    y : 输入的target vector
    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点
    cv : 做cross-validation的时候,数据分成的份数,其中一份作为cv集,其余n-1份作为training(默认为3份)
    n_jobs : 并行的的任务数(默认1)
    """
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    if plot:
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel(u"训练样本数")
        plt.ylabel(u"得分")
        plt.gca().invert_yaxis()
        plt.grid() # 网格
    
        plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, 
                         alpha=0.1, color="b") #填充两条线间区域
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, 
                         alpha=0.1, color="r")
        plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train score")
        plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"test score")
    
        plt.legend(loc="best")
        
        plt.draw()
        plt.gca().invert_yaxis()
        plt.show()
    
    midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2
    diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])
    return midpoint, diff
plot_learning_curve(Model, u"学习曲线",train_X,train_y)

共享单车hadoop摘要 共享单车项目摘要_共享单车hadoop摘要_29

(0.969214095199126, 0.048081832611191144)
# 预测数据
pred_value = Model.predict(pred_X)
pred_value = np.exp(pred_value)
submission = pd.DataFrame({'datetime':datetimecol, 'count':pred_value})
submission['count'] = submission['count'].astype(int)
submission.to_csv('bike_predictions.csv',index = False)
submission.head()



datetime

count

0

2011-01-20 00:00:00

10

1

2011-01-20 01:00:00

3

2

2011-01-20 02:00:00

3

3

2011-01-20 03:00:00

6

4

2011-01-20 04:00:00

37

5

2011-01-20 05:00:00

90