共享单车 Python作业共享单车需求分析python

转载

mob64ca14106f2f 2024-06-14 21:58:06

文章标签 共享单车 Python作业数据分析数据可视化 sklearn机器学习随机森林 文章分类 Python 后端开发

Kaggle共享单车需求项目详解

1.查看数据
2.数据预处理
3.分析数据

3.1 时段对租赁数量的影响
3.2 温度对租赁数量的影响
3.3 湿度对租赁数量的影响
3.4 年份、月份对租赁数量的影响
3.5 天气情况对出行情况的影响
3.6 风速对出行情况的影响
3.7 查看不同风速对租赁数量的影响
3.8 查看风速异常情况数据
3.9 日期对出行的影响
查看每天临时与会员占比情况
3.10 工作日与非工作日的租赁情况
3.11 星期对租赁数量的影响
3.12 节假日的租赁情况

4 创建机器学习模型

4.1 将多类别数据转换成二分类数据
4.2 将不需要的列删去,只留下二分类的特征数据
4.3 选择模型、训练模型
4.4 预测测试集数据
保存数据
4.5 使用逻辑回归模型对count进行预测
保存数据

1.查看数据

导入库

%matplotlib inline

import numpy as np
import pandas as pd 
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style='whitegrid' , palette='tab10')

导入数据

# 导入训练数据
train=pd.read_csv('E:/PythonData/Kaggle_Data/train.csv')

查看数据

train.head()

共享单车 Python作业共享单车需求分析python_数据可视化

查看数据结构

train.shape

共享单车 Python作业共享单车需求分析python_随机森林_02

# 查看训练集是否有缺失值
train.info()

共享单车 Python作业共享单车需求分析python_数据分析_03

分析：由上述信息可知，训练集数据不存在缺失值。

查看数据统计性信息

#观察训练集数据描述统计
train.describe()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_04

查看租赁数量分布图

# 绘制租赁额分布图
sns.distplot(train['count'])

# 添加标注
plt.title('Distribution of count')
plt.show()

共享单车 Python作业共享单车需求分析python_随机森林_05

2.数据预处理

# 去除与租赁租赁数量平均值相差3个标准差的租赁数据
train_WithoutOutliers = train[np.abs(train['count']-
                        train['count'].mean())<=(3*train['count'].std())] 

train_WithoutOutliers .shape

共享单车 Python作业共享单车需求分析python_sklearn机器学习_06

print('一共去除了%s条数据'%(10886-10739))

一共去除了147条数据

# 去除3个标准差的数据的统计性信息
train_WithoutOutliers['count'] .describe()

共享单车 Python作业共享单车需求分析python_数据分析_07

对比去除3个标准差数据前后的租赁总数量分布图

# 查看去除后数据的count的分布于未处理的数据、
sns.distplot(train['count'])
plt.title('Distribution of count')
plt.show()

共享单车 Python作业共享单车需求分析python_数据可视化_08

sns.distplot(train_WithoutOutliers['count'])
plt.title('Distribution of count(train_WithoutOuterliers)')
plt.show()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_09

将数据转换成对数数值

# 由于数据波动较大，因此通过对数据count进行转换成其对数形式
yLabels=train_WithoutOutliers['count']
yLabels_log=np.log(yLabels)
sns.distplot(yLabels_log)

共享单车 Python作业共享单车需求分析python_数据分析_10

导入测试集数据

test=pd.read_csv('E:/PythonData/Kaggle_Data/test.csv')

查看数据

test.head()

共享单车 Python作业共享单车需求分析python_数据分析_11

test.info()

共享单车 Python作业共享单车需求分析python_数据分析_12

合并数据集

Bike_data=pd.concat([train_WithoutOutliers,test],ignore_index=True)
Bike_data.head()

共享单车 Python作业共享单车需求分析python_数据分析_13

Bike_data.tail()

共享单车 Python作业共享单车需求分析python_数据分析_14

查看合并后数据结构

# 查看数据结构
Bike_data.shape

共享单车 Python作业共享单车需求分析python_数据可视化_15

转换数据类型

将日期分割成日期，时段，年，月，星期

# 对日期进行处理，转换成日期，时段，年份，月份，星期
Bike_data['date']=Bike_data.datetime.apply( lambda c : c.split( )[0])
Bike_data['hour']=Bike_data.datetime.apply( lambda c : c.split( )[1].split(':')[0]).astype('int')
Bike_data['year']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('-')[0]).astype('int')
Bike_data['month']=Bike_data.datetime.apply( lambda c : c.split( )[0].split('-')[1]).astype('int')
Bike_data['weekday']=Bike_data.date.apply( lambda c : datetime.strptime(c,'%Y-%m-%d').isoweekday())
Bike_data.head()

共享单车 Python作业共享单车需求分析python_数据可视化_16

查看数值型特征分布情况

# 查看temp（温度），atemp（体感温度），humidity（湿度）、windspeed（风速）等数值型数据的分布情况
fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12,10)

sns.distplot(Bike_data['temp'],ax=axes[0,0])
sns.distplot(Bike_data['atemp'],ax=axes[0,1])
sns.distplot(Bike_data['humidity'],ax=axes[1,0])
sns.distplot(Bike_data['windspeed'],ax=axes[1,1])

axes[0,0].set(title='Distribution of temp',)
axes[0,1].set(title='Distribution of atemp')
axes[1,0].set(title='Distribution of humidity')
axes[1,1].set(title='Distribution of windspeed')

plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_17

分析：查看上图发现，风速为0的数据居多。明显属于非正常数据

# 查看风速非0的数据的风速统计性信息描述
Bike_data[Bike_data['windspeed']!=0]['windspeed'].describe()

共享单车 Python作业共享单车需求分析python_数据可视化_18

使用随机森林模型填充风速为0的数据

# 使用随机森林填充风速
from sklearn.ensemble import RandomForestRegressor

Bike_data["windspeed_rfr"]=Bike_data["windspeed"]

# 将数据分成风速等于0和不等于两部分
dataWind0 = Bike_data[Bike_data["windspeed_rfr"]==0]
dataWindNot0 = Bike_data[Bike_data["windspeed_rfr"]!=0]

#选定模型
rfModel_wind = RandomForestRegressor(n_estimators=1000,random_state=42)

# 选定特征值
windColumns = ["season","weather","humidity","month","temp","year","atemp"]

# 将风速不等于0的数据作为训练集，fit到RandomForestRegressor之中
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed_rfr"])

#通过训练好的模型预测风速
wind0Values = rfModel_wind.predict(X= dataWind0[windColumns])

#将预测的风速填充到风速为零的数据中
dataWind0.loc[:,"windspeed_rfr"] = wind0Values

#连接两部分数据
Bike_data = dataWindNot0.append(dataWind0)
Bike_data.reset_index(inplace=True)
Bike_data.drop('index',inplace=True,axis=1)

填充好再画图观察一下这四个特征值的密度分布

fig, axes = plt.subplots(2, 2)
fig.set_size_inches(12,10)

sns.distplot(Bike_data['temp'],ax=axes[0,0])
sns.distplot(Bike_data['atemp'],ax=axes[0,1])
sns.distplot(Bike_data['humidity'],ax=axes[1,0])
sns.distplot(Bike_data['windspeed_rfr'],ax=axes[1,1])

axes[0,0].set(title='Distribution of temp',)
axes[0,1].set(title='Distribution of atemp')
axes[1,0].set(title='Distribution of humidity')
axes[1,1].set(title='Distribution of windspeed')

plt.show()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_19

3.分析数据

查看其他数值型特征值与临时和会员租赁总数的关系

# 查看临时租赁和会员租赁以及租赁总数与其他特征数值的关系
sns.pairplot(Bike_data ,x_vars=['holiday','workingday','weather','season',
                                'weekday','hour','windspeed_rfr','humidity','temp','atemp'] ,
                        y_vars=['casual','registered','count'] , plot_kws={'alpha': 0.1})
plt.show()

共享单车 Python作业共享单车需求分析python_随机森林_20

分析：
1.会员在工作日出行多，节假日出行少，临时用户则相反；
2.一季度出行人数总体偏少；
3.租赁数量随天气等级上升而减少；
4.小时数对租赁情况影响明显，会员呈现两个高峰，非会员呈现一个正态分布；
5.租赁数量随风速增大而减少；
6.温度、湿度对非会员影响比较大，对会员影响较小。

创建相关性矩阵

#创建相关性矩阵
corrDf = Bike_data.corr() 

#ascending=False表示按降序排列
corrDf['count'].sort_values(ascending =False)

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_21

分析：

可以看出特征值对租赁数量的影响力度为,时段>温度>湿度>年份>月份>季节>天气等级>风速>星期几>是否工作日>是否假日，接下来再看一下共享单车整体使用情况。

3.1 时段对租赁数量的影响

# 时段对租赁数量的影响
# 查看工作日与非工作日每小时的临时租赁数量，会员租赁数量和总数量的平均值
workingday_df=Bike_data[Bike_data['workingday']==1]
workingday_df = workingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean',
                                                                    'registered':'mean',
                                                                    'count':'mean'})

nworkingday_df=Bike_data[Bike_data['workingday']==0]
nworkingday_df = nworkingday_df.groupby(['hour'], as_index=True).agg({'casual':'mean',
                                                                      'registered':'mean', 
                                                                      'count':'mean'})
fig, axes = plt.subplots(1, 2,sharey = True)

workingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the working day',ax=axes[0])
nworkingday_df.plot(figsize=(15,5),title = 'The average number of rentals initiated per hour in the nonworkdays',ax=axes[1])

plt.show()

共享单车 Python作业共享单车需求分析python_数据分析_22

分析：
1.工作日对于会员用户上下班时间是两个用车高峰，而中午也会有一个小高峰，猜测可能是外出午餐的人；
2.而对临时用户起伏比较平缓，高峰期在17点左右；
3.并且会员用户的用车数量远超过临时用户。
4.对非工作日而言租赁数量随时间呈现一个正态分布，高峰在14点左右，低谷在4点左右，且分布比较均匀。

3.2 温度对租赁数量的影响

#数据按小时统计展示起来太麻烦，希望能够按天汇总取一天的气温中位数
temp_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
                                                                     'month':'mean',
                                                                     'temp':'median'})

#由于测试数据集中没有租赁信息，会导致折线图有断裂，所以将缺失的数据丢弃
temp_df.dropna ( axis = 0 , how ='any', inplace = True )

#预计按天统计的波动仍然很大，再按月取日平均值
temp_month = temp_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                    'temp':'median'})

#将按天求和统计数据的日期转换成datetime格式
temp_df['date']=pd.to_datetime(temp_df['date'])

#将按月统计数据设置一列时间序列
temp_month.rename(columns={'weekday':'day'},inplace=True)
temp_month['date']=pd.to_datetime(temp_month[['year','month','day']])

#设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

#使用折线图展示总体租赁情况（count）随时间的走势
plt.plot(temp_df['date'] , temp_df['temp'] , linewidth=1.3 , label='Daily average')
ax.set_title('Change trend of average temperature per day in two years')
plt.plot(temp_month['date'] , temp_month['temp'] , marker='o', linewidth=1.3 ,
         label='Monthly average')
ax.legend()

plt.show()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_23

#按温度取租赁额平均值
# 查看三个租赁平均值随温度的变化
temp_rentals = Bike_data.groupby(['temp'], as_index=True).agg({'casual':'mean', 
                                                               'registered':'mean',
                                                               'count':'mean'})

temp_rentals .plot(title = 'The average number of rentals initiated per hour changes with the temperature')
plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_24

分析：

可观察到随气温上升租车数量总体呈现上升趋势，但在气温超过35时开始下降，在气温4度时达到最低点。

3.3 湿度对租赁数量的影响

humidity_df = Bike_data.groupby('date', as_index=False).agg({'humidity':'mean'})
humidity_df['date']=pd.to_datetime(humidity_df['date'])

#将日期设置为时间索引
humidity_df=humidity_df.set_index('date')

humidity_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                          'humidity':'mean'})
humidity_month.rename(columns={'weekday':'day'},inplace=True)
humidity_month['date']=pd.to_datetime(humidity_month[['year','month','day']])

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(humidity_df.index , humidity_df['humidity'] , linewidth=1.3,label='Daily average')
plt.plot(humidity_month['date'], humidity_month['humidity'] ,marker='o', 
         linewidth=1.3,label='Monthly average')
ax.legend()
ax.set_title('Change trend of average humidity per day in two years')
plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_25

# 同样查看租赁数量平均值岁湿度的变化情况
humidity_rentals = Bike_data.groupby(['humidity'], as_index=True).agg({'casual':'mean',
                                                                       'registered':'mean',
                                                                       'count':'mean'})

humidity_rentals .plot (title = 'Average number of rentals initiated per hour in different humidity')
plt.show()

共享单车 Python作业共享单车需求分析python_随机森林_26

分析：
可以观察到在湿度20左右租赁数量迅速达到高峰值，此后缓慢递减。

3.4 年份、月份对租赁数量的影响

#数据按小时统计展示起来太麻烦，希望能够按天汇总
count_df = Bike_data.groupby(['date','weekday'], as_index=False).agg({'year':'mean',
                                                                      'month':'mean',
                                                                      'casual':'sum',
                                                                      'registered':'sum',
                                                                       'count':'sum'})

#由于测试数据集中没有租赁信息，会导致折线图有断裂，所以将缺失的数据丢弃
count_df.dropna ( axis = 0 , how ='any', inplace = True )

#预计按天统计的波动仍然很大，再按月取日平均值
count_month = count_df.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                      'casual':'mean', 
                                                                      'registered':'mean',
                                                                      'count':'mean'})

#将按天求和统计数据的日期转换成datetime格式
count_df['date']=pd.to_datetime(count_df['date'])

#将按月统计数据设置一列时间序列
count_month.rename(columns={'weekday':'day'},inplace=True)
count_month['date']=pd.to_datetime(count_month[['year','month','day']])

#设置画框尺寸
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)

#使用折线图展示总体租赁情况（count）随时间的走势
plt.plot(count_df['date'] , count_df['count'] , linewidth=1.3 , label='Daily average')
ax.set_title('Change trend of average number of rentals initiated  per day in two years')
plt.plot(count_month['date'] , count_month['count'] , marker='o', 
         linewidth=1.3 , label='Monthly average')
ax.legend()
plt.show()

共享单车 Python作业共享单车需求分析python_随机森林_27

对月份进行分组，查看每个月份的租赁情况

day_df=Bike_data.groupby('date').agg({'year':'mean','season':'mean',
                                      'casual':'sum', 'registered':'sum'
                                      ,'count':'sum','temp':'mean',
                                      'atemp':'mean'})
season_df = day_df.groupby(['year','season'], as_index=True).agg({'casual':'mean', 
                                                                  'registered':'mean',
                                                                  'count':'mean'})
temp_df = day_df.groupby(['year','season'], as_index=True).agg({'temp':'mean', 
                                                                'atemp':'mean'})

season_df.plot()
temp_df.plot()
plt.show()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_28

共享单车 Python作业共享单车需求分析python_sklearn机器学习_29

3.5 天气情况对出行情况的影响

count_weather = Bike_data.groupby('weather')
count_weather[['casual','registered','count']].count()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_30

不同类型天气对应租赁数的平均值

weather_df = Bike_data.groupby('weather', as_index=True).agg({'casual':'mean',
                                                              'registered':'mean'})


weather_df.plot.bar(stacked=True,title = 'Average number of rentals initiated per hour in different weather')
plt.xticks(rotation=360)
plt.show()

共享单车 Python作业共享单车需求分析python_数据可视化_31

查看天气等级为4的数据：

Bike_data[Bike_data['weather']==4]

共享单车 Python作业共享单车需求分析python_随机森林_32

3.6 风速对出行情况的影响

windspeed_df = Bike_data.groupby('date', as_index=False).agg({'windspeed_rfr':'mean'})
windspeed_df['date']=pd.to_datetime(windspeed_df['date'])
#将日期设置为时间索引
windspeed_df=windspeed_df.set_index('date')

windspeed_month = Bike_data.groupby(['year','month'], as_index=False).agg({'weekday':'min',
                                                                           'windspeed_rfr':'mean'})
windspeed_month.rename(columns={'weekday':'day'},inplace=True)
windspeed_month['date']=pd.to_datetime(windspeed_month[['year','month','day']])

fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(1,1,1)
plt.plot(windspeed_df.index , windspeed_df['windspeed_rfr'] , linewidth=1.3,label='Daily average')
plt.plot(windspeed_month['date'], windspeed_month['windspeed_rfr'] ,
         marker='o', linewidth=1.3,label='Monthly average')
ax.legend()
ax.set_title('Change trend of average number of windspeed  per day in two years')
plt.show()

共享单车 Python作业共享单车需求分析python_数据可视化_33

可以看出风速在2011年9月份和2011年12月到2012年3月份间波动和大，观察一下租赁人数随风速变化趋势，考虑到风速特别大的时候很少，如果取平均值会出现异常，所以按风速对租赁数量取最大值。

3.7 查看不同风速对租赁数量的影响

windspeed_rentals = Bike_data.groupby(['windspeed'], as_index=True).agg({'casual':'max', 
                                                                         'registered':'max',
                                                                         'count':'max'})

windspeed_rentals .plot(title = 'Max number of rentals initiated per hour in different windspeed')
plt.show()

共享单车 Python作业共享单车需求分析python_随机森林_34

分析：风速在30以后，风速越大，相对于的租赁数就越少。但在分数在43-44附近出现了异常情况。

3.8 查看风速异常情况数据

# 查看风速大于40，租赁数量大于400的数据集
df2=Bike_data[Bike_data['windspeed']>40]
df2=df2[df2['count']>400]
df2

共享单车 Python作业共享单车需求分析python_数据可视化_35

3.9 日期对出行的影响

day_df = Bike_data.groupby(['date'], as_index=False).agg({'casual':'sum','registered':'sum',
                                                          'count':'sum', 'workingday':'mean',
                                                          'weekday':'mean','holiday':'mean',
                                                          'year':'mean'})
day_df.head()

共享单车 Python作业共享单车需求分析python_随机森林_36

查看每天临时与会员占比情况

number_pei=day_df[['casual','registered']].mean()
number_pei

共享单车 Python作业共享单车需求分析python_数据可视化_37

绘制不同占比情况饼图

plt.axes(aspect='equal')  
plt.pie(number_pei, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.05 , radius=1 )  

plt.title('Casual or registered in the total lease')
plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_38

3.10 工作日与非工作日的租赁情况

# 对非工作日进行分组
workingday_df=day_df.groupby(['workingday'], as_index=True).agg({'casual':'mean', 
                                                                 'registered':'mean'})

# 分别将不是工作日和是工作日的数据归为一类
workingday_df_0 = workingday_df.loc[0]
workingday_df_1 = workingday_df.loc[1]

# plt.axes(aspect='equal')
fig = plt.figure(figsize=(8,6)) 
plt.subplots_adjust(hspace=0.5, wspace=0.2)     #设置子图表间隔
grid = plt.GridSpec(2, 2, wspace=0.5, hspace=0.5)   #设置子图表坐标轴 对齐

plt.subplot2grid((2,2),(1,0), rowspan=2)
width = 0.3       # 设置条宽

# 绘制非工作日与工作日的临时借车数量和会员借车数量堆积图
p1 = plt.bar(workingday_df.index,workingday_df['casual'], width)
p2 = plt.bar(workingday_df.index,workingday_df['registered'], 
             width,bottom=workingday_df['casual'])
plt.title('Average number of rentals initiated per day')
plt.xticks([0,1], ('nonworking day', 'working day'),rotation=20)
plt.legend((p1[0], p2[0]), ('casual', 'registered'))

# 绘制非工作日临时借车数量和会员借车数量占比饼图
plt.subplot2grid((2,2),(0,0))
plt.pie(workingday_df_0, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.axis('equal') 
plt.title('nonworking day')

# 绘制工作日临时借车数量和会员借车数量占比饼图
plt.subplot2grid((2,2),(0,1))
plt.pie(workingday_df_1, labels=['casual','registered'], autopct='%1.1f%%', 
        pctdistance=0.6 , labeldistance=1.35 , radius=1.3)
plt.title('working day')
plt.axis('equal')

共享单车 Python作业共享单车需求分析python_随机森林_39

3.11 星期对租赁数量的影响

weekday_df= day_df.groupby(['weekday'], as_index=True).agg({'casual':'mean', 'registered':'mean'})
weekday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by weekday')

plt.xticks(rotation=360)
plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_40

分析：
1.工作日会员用户出行数量较多，临时用户出行数量较少； 2.周末会员用户租赁数量降低，临时用户租赁数量增加。

3.12 节假日的租赁情况

holiday_count=day_df.groupby('year', as_index=True).agg({'holiday':'sum'})
holiday_count

共享单车 Python作业共享单车需求分析python_数据分析_41

holiday_df = day_df.groupby('holiday', as_index=True).agg({'casual':'mean', 'registered':'mean'})
holiday_df.plot.bar(stacked=True , title = 'Average number of rentals initiated per day by holiday or not')

plt.xticks(rotation=360)
plt.show()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_42

4 创建机器学习模型

根据前面的观察，决定将时段（hour）、温度（temp）、湿度（humidity）、年份（year）、月份（month）、季节（season）、天气等级（weather）、风速（windspeed_rfr）、星期几（weekday）、是否工作日（workingday）、是否假日（holiday），11项作为特征值

4.1 将多类别数据转换成二分类数据

由于CART决策树使用二分类，所以将多类别型数据使用one-hot转化成多个二分型类别

dummies_month = pd.get_dummies(Bike_data['month'], prefix= 'month')
dummies_season=pd.get_dummies(Bike_data['season'],prefix='season')
dummies_weather=pd.get_dummies(Bike_data['weather'],prefix='weather')
dummies_year=pd.get_dummies(Bike_data['year'],prefix='year')
#把5个新的DF和原来的表连接起来
Bike_data=pd.concat([Bike_data,dummies_month,dummies_season,dummies_weather,dummies_year],axis=1)

dummies_month.head()

共享单车 Python作业共享单车需求分析python_数据可视化_43

Bike_data.head()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_44

dataTrain = Bike_data[pd.notnull(Bike_data['count'])]
dataTest= Bike_data[~pd.notnull(Bike_data['count'])].sort_values(by=['datetime'])
datetimecol = dataTest['datetime']
yLabels=dataTrain['count']
yLabels_log=np.log(yLabels)

dataTrain.head()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_45

dataTest.head()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_46

print(dataTrain.shape)
print(dataTest.shape)

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_47

4.2 将不需要的列删去,只留下二分类的特征数据

dropFeatures = ['casual' , 'count' , 'datetime' , 'date' , 'registered' ,
                'windspeed' , 'atemp' , 'month','season','weather', 'year' ]

dataTrain = dataTrain.drop(dropFeatures , axis=1)
dataTest = dataTest.drop(dropFeatures , axis=1)

查看数据

dataTrain.head()

共享单车 Python作业共享单车需求分析python_sklearn机器学习_48

print(dataTrain.shape)

共享单车 Python作业共享单车需求分析python_随机森林_49

dataTest.head()

共享单车 Python作业共享单车需求分析python_数据分析_50

dataTest.shape

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_51

4.3 选择模型、训练模型

rfModel = RandomForestRegressor(n_estimators=1000 , random_state = 42)

# n_estimators代表森林中树的数量,

rfModel.fit(dataTrain , yLabels_log)

preds = rfModel.predict( X = dataTrain)

preds

共享单车 Python作业共享单车需求分析python_sklearn机器学习_52

preds.shape

共享单车 Python作业共享单车需求分析python_sklearn机器学习_53

4.4 预测测试集数据

predsTest= rfModel.predict(X = dataTest)

# 将测试集数据以及对应的时间转化成DataFrame
submission=pd.DataFrame({'datetime':datetimecol , 'count':[max(0,x) for x in np.exp(predsTest)]})

submission.head()

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_54

submission.shape

共享单车 Python作业共享单车需求分析python_数据分析_55

保存数据

submission.to_csv('bike_predictions.csv',index=False)

4.5 使用逻辑回归模型对count进行预测

from sklearn.linear_model import LogisticRegression

# Logistic Regression 逻辑回归模型
logreg = LogisticRegression()
logreg.fit(dataTrain , yLabels_log.astype('int'))
Y_pred_logreg = logreg.predict(dataTrain)
acc_log = round(logreg.score(dataTrain , yLabels_log.astype('int'))*100,2)

# 预测结果
Y_pred_logreg.shape

共享单车 Python作业共享单车需求分析python_数据可视化_56

Y_pred_logreg

共享单车 Python作业共享单车需求分析python_sklearn机器学习_57

Y_pred_logreg = logreg.predict(dataTest)
Y_pred_logreg.shape

共享单车 Python作业共享单车需求分析python_共享单车 Python作业_58

Y_pred_logreg

共享单车 Python作业共享单车需求分析python_数据分析_59

submission2=pd.DataFrame({'datetime':datetimecol , 'count':[max(0,x) for x in np.exp(Y_pred_logreg)]})
submission2.head()

共享单车 Python作业共享单车需求分析python_数据可视化_60

保存数据

submission2.to_csv('bike_predictions.csv',index=False)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java代码如何实现将多张图片拼接成一张 java如何把两个图放一起

下一篇：ai画图共享gpu内存 ai的gpu预览

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯