时间序列做机器学习时间序列处理

转载

mob64ca1417eedd 2024-07-25 18:59:56

文章标签 时间序列做机器学习 python 时间序列 pandas处理时间 python处理时间 文章分类 机器学习人工智能

【手把手教你】时间序列之日期处理

时间序列是金融量化分析中最常见的数据类型，记录某一变量或特征沿着时间轴而取值，比如某只股票2008-2018年日收盘价。量化分析的一个重要环节之一是基于历史数据进行分析和挖掘，试图从历史的维度探究某一事物的变化规律或发展趋势（做预测）。目前，时间序列分析理论已经相对成熟，包括一般统计分析（如平稳性、自相关、谱分析等）、统计建模和推断、时间序列预测（包括流行的机器学习、深度学习，如LSTM模型）和滤波控制等。在使用Python分析时间序列时，经常会碰到时间日期格式处理和转换问题，尤其在可视化分析和分时期统计方面。本文作为时间序列分析的入门指引之一，将着重介绍如何利用Python处理日期和分时期统计分析，希望能起到抛砖引玉的作用。

先引入数据分析和画图常用库，pandas、numpy、matplotlib以及中文乱码处理，画图也可以使用pyecharts、seaborn、bokeh等，以下代码均使用Jupyter notebook（python3.7）编译。

****import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline   
#正常显示画图时出现的中文和负号
from pylab import mpl
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False****

1 datetime处理日期

python常用的处理时间的库有：datetime,time,calendar。datetime库包括了date（储存日期：(年、月、日),time(储存时间：(小时、分、秒和微秒)，datetime同时包含了data和time，timedelta代表两个datetime之间的差（天、秒、微秒）。

from datetime import datetime
now=datetime.now()
print(f'当前时间：{now}')
print(f'{now.year}年{now.month}月{now.day}日')
输出结果：
当前时间：2019-01-11 10:25:21.445490
2019年1月11日
now.strftime('%Y-%m-%d')
输出结果：
'2019-01-10'
delta=datetime(2019,1,10)-datetime(2019,1,1,12,30)
delta
输出结果：
datetime.timedelta(days=8, seconds=41400)
from datetime import timedelta
start=datetime(2018,1,1)
#计算50天后是哪一天
start+timedelta(50)
datetime.datetime(2018, 2, 20, 0, 0)
#字符串和时间的转化
#比如想要知道列表里两个时间字符串之间相差多少天
datestr=['12/20/2018','12/11/2018']
new_date=[datetime.strptime(d,'%m/%d/%Y') for d in datestr]
new_date[0]-new_date[1]
输出结果：
datetime.timedelta(days=9)
#将datetime格式转换为常见的年（Y）月（m）日（d）格式表示
[date.strftime('%Y-%m-%d') for date in new_date]
输出结果：
['2018-12-20', '2018-12-11']

datetime.strptime只能根据设定的时间格式来处理指定的字符串，如果列表里(list)包含不止一种格式的字符串，如datestr=[‘12/20/2018’,‘12/11/2018’,‘2018-10-18’],使用datetime.strptime就很难处理了。遇到这种情况可以引入第三方时间处理包dateutil，可以处理任意格式字符串。

from dateutil.parser import parse
datestr=['12/20/2018','20180210','2019-01-10']
#转换成datetime格式
new_d=[parse(d) for d in datestr]
#统一为12/20/2018格式
d1=[d.strftime('%m/%d/%Y') for d in new_d]
d2=[d.strftime('%Y%m%d') for d in new_d]
d3=[d.strftime('%Y-%m-%d') for d in new_d]
d4=[d.strftime('%y-%m-%d') for d in new_d]
print(f'datetime格式：\n{new_d}')
print(f'"月/日/年"格式：\n {d1}')
print(f'"年月日"格式：\n{d2}')
print(f'"年-月-日格式"：\n{d3}')
print(f'"年（后两位）-月-日"格式：\n{d4}')
输出结果：
datetime格式：
[datetime.datetime(2018, 12, 20, 0, 0),
 datetime.datetime(2018, 2, 10, 0, 0), 
 datetime.datetime(2019, 1, 10, 0, 0)]
"月/日/年"格式：
 ['12/20/2018', '02/10/2018', '01/10/2019']
"年月日"格式：
['20181220', '20180210', '20190110']
"年-月-日格式"：
['2018-12-20', '2018-02-10', '2019-01-10']
"年（后两位）-月-日"格式：
['18-12-20', '18-02-10', '19-01-10']

2 使用NumPy库处理日期

numpy库主要用于数组操作（线性代数分析），但在处理日期和时间数据上功能也很强大，其时间格式是datetime64

#将字符串转换成numpy格式时间
#注意个位前补0，如1月写成01
nd=np.datetime64('2019-01-10')
nd
输出结果：
numpy.datetime64('2019-01-10')
#转化为字符串
np.datetime_as_string(nd)
输出结果：
'2019-01-10'
np.datetime64('1901')
输出结果：
numpy.datetime64('1901')
#转化为datetime格式
nd.astype(datetime)
输出结果
datetime.date(2019, 1, 10)
#生成时间序列
#默认以日为间隔，算头不算尾
np.arange('2019-01-05','2019-01-10',dtype='datetime64')
输出结果
array(['2019-01-05', '2019-01-06', '2019-01-07', 
     '2019-01-08',
     '2019-01-09'], dtype='datetime64[D]')
#以月为间隔，生成2018年12个月
np.arange('2018-01-01','2019-01-01',dtype='datetime64[M]')
输出结果：
array(['2018-01', '2018-02', '2018-03', '2018-04', 
        '2018-05', '2018-06',
        '2018-07', '2018-08', '2018-09', '2018-10', 
        '2018-11', '2018-12'],
        dtype='datetime64[M]')
#以年为间隔
np.arange('2015-01-01','2019-01-20',dtype='datetime64[Y]')
输出结果：
array(['2015', '2016', '2017', '2018'], 
      dtype='datetime64[Y]')
#以周为间隔
np.arange('2018-12-01','2018-12-20',dtype='datetime64[W]')
输出结果：
array(['2018-11-29', '2018-12-06', '2018-12-13'], 
       dtype='datetime64[W]')
#设定随机种子（括号里的数字只是起标记作用）
np.random.seed(1)
#h:小时，m:分，s：秒，ms微秒
#生成分时
x=np.arange('2019-01-10T00:00:00',
'2019-01-10T23:00:00',dtype='datetime64[m]')
#生成标准正态分布时间序列
y=np.random.standard_normal(len(x))
#设置图片大小
fig=plt.figure(figsize=(12,6))
#将x的np.datetime转换为datetime.datetime
plt.plot(x.astype(datetime),y)
fig.autofmt_xdate()
plt.title('模拟23小时内每分钟正态分布的随机数分布')
# 将右边 上边的两条边颜色设置为空 其实就相当于抹掉这两条边
ax = plt.gca()  
ax.spines['right'].set_color('none') 
ax.spines['top'].set_color('none')  
plt.show()

时间序列做机器学习时间序列处理_python处理时间

3 Pandas库处理日期

Pandas库是处理时间序列的利器，pandas有着强大的日期数据处理功能，可以按日期筛选数据、按日期显示数据、按日期统计数据。pandas的实际类型主要分为timestamp（时间戳）、period（时期）和时间间隔（timedelta）,常用的日期处理函数有：pd.to_datetime(),pd.to_period(),pd.date_range(),pd.period_range；pandas的resample函数还提供了对日期样本的转换，如高低频数据转化等。

01 时间格式处理及转换
定义时间格式和不同格式之间相互转换，常用函数：pd.Timestamp(),pd.Period(),pd.to_timestamp(),pd.to_datetime()，pd.to_period()。

#定义timestamp
t1=pd.Timestamp('2019-01-10')
t2=pd.Timestamp('2018-12-10')
print(f't1= {t1}')
print(f't2= {t2}')
print(f't1与t2时间间隔：{(t1-t2).days}天')
输出结果：
t1= 2019-01-10 00:00:00
t2= 2018-12-10 00:00:00
t1与t2时间间隔：31天
#获取当前时间
now=pd.datetime.now()
print(now)
print(now.strftime('%Y-%m-%d'))
输出结果：
2019-01-11 11:56:49.014612
2019-01-11
#时间间隔
pd.Timedelta(days=5, minutes=50, seconds=20,
   milliseconds=10, microseconds=10, nanoseconds=10)
输出结果：
Timedelta('5 days 00:50:20.010010')
#计算当前时间往后100天的日期
dt=now+pd.Timedelta(days=100)
#只显示年月日
dt.strftime('%Y-%m-%d')
输出结果：
'2019-04-21'
#定义时期period，默认是A-DEC，代表年份，以12月作为最后一个月
p1=pd.Period('2019')
p2=pd.Period('2018')
print(f'p1={p1}年')
print(f'p2={p2}年')
print(f'p1和p2间隔{p1-p2}年')
#可以直接+、-整数（代表年）
print(f'十年前是{p-10}年')
输出结果：
p1=2019年
p2=2018年
p1和p2间隔1年
十年前是2009年
#通过asfreq转换时期频率
#以第一个月算,p1前面已赋值为2019年
p1.asfreq('M','start')
输出结果：
Period('2019-01', 'M')
#以最后一个月算
p1.asfreq('M','end')
输出结果：
Period('2019-12', 'M')
#财报季度
p=pd.Period('2019Q3',freq='Q-DEC')
#起始月日
print(p.asfreq('D','start'))
#结束月日
print(p.asfreq('D','end'))
结果输出：
2019-07-01
2019-09-30
#时间戳和时期相互转换
print(p1.to_timestamp(how='end'))
print(p1.to_timestamp(how='start'))
输出结果：
2019-12-31 00:00:00
2019-01-01 00:00:00
#t1前面赋值为'2019-1-10'
#转换为月时期
print(t1.to_period('M'))
#转换为日时期
print(t1.to_period('D'))
print(t1.to_period('W'))
输出结果：
2019-01
2019-01-10
2019-01-07/2019-01-13

02 生成日期序列
常用函数：pd.date_range()，生成的是DatetimeIndex格式的日期序列；pd.period_range()，生成PeriodIndex的时期日期序列。

#使用date_range生成日期序列
#如要详细了解该函数，可以使用help(pd.date_range)
#参数四选三：起始时间，结束时间，freq，periods
#freq='M'月，'D'天，'W'，周，'Y'年
#生成月时间序列
dm = pd.date_range('2018/01/01', freq='M', periods=12)
print(f'生成月时间序列：\n{dm}')
#算头不算尾
#生成年时间序列,默认是以12月结尾，freq='Y-DEC'
dy=pd.date_range('2008-01-01','2019-01-10',freq='Y')
print(f'生成年时间序列：\n{dy}')
#生成日时间序列
dd=pd.date_range('2018-01-01',freq='D',periods=10)
print(f'生成日时间序列：\n{dd}')
#生成周时间序列,默认以sunday周日作为一周最后一日
#如要改成周一作为第一天，freq='W-SAT'
dw=pd.date_range('2018-01-01',freq='W',periods=10)
print(f'生成周时间序列：\n{dw}')
输出结果：
生成月时间序列：
DatetimeIndex(['2018-01-31', '2018-02-28', 
'2018-03-31', '2018-04-30','2018-05-31', 
'2018-06-30', '2018-07-31', '2018-08-31',
'2018-09-30', '2018-10-31', '2018-11-30', 
'2018-12-31'], dtype='datetime64[ns]', freq='M')
生成年时间序列：
DatetimeIndex(['2008-12-31', '2009-12-31', 
'2010-12-31', '2011-12-31','2012-12-31', 
'2013-12-31', '2014-12-31', '2015-12-31',
'2016-12-31', '2017-12-31', '2018-12-31'],
dtype='datetime64[ns]', freq='A-DEC')
生成日时间序列：
DatetimeIndex(['2018-01-01', '2018-01-02', 
'2018-01-03', '2018-01-04','2018-01-05', 
'2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10'],
dtype='datetime64[ns]', freq='D')
生成周时间序列：
DatetimeIndex(['2018-01-07', '2018-01-14', 
'2018-01-21', '2018-01-28','2018-02-04', 
'2018-02-11', '2018-02-18', '2018-02-25',
'2018-03-04', '2018-03-11'],
dtype='datetime64[ns]', freq='W-SUN')
#使用period_range生成日期序列
#参数四选三：起始时间，结束时间，freq，periods
#freq='M'月，'D'天，'W'，周，'Y'年
#生成月时期序列
dpm = pd.period_range('2019/01/01', freq='M', periods=12)
print(f'生成月时间序列：\n{dpm}')
#生成年时期序列,默认是以12月结尾，freq='Y-DEC'
dpy=pd.period_range('2008-01-01','2019-01-10',freq='Y')
print(f'生成年时间序列：\n{dpy}')
#生成日时期序列
dpd=pd.period_range('2018-01-01',freq='D',periods=10)
print(f'生成日时间序列：\n{dpd}')
#生成周时期序列,默认以sunday周日作为一周最后一日
#如要改成周一作为第一天，freq='W-SAT'
dpw=pd.period_range('2018-01-01',freq='W-SUN',periods=10)
print(f'生成周时间序列：\n{dpw}')
输出结果：
生成月时间序列：
PeriodIndex(['2019-01', '2019-02', '2019-03', 
'2019-04', '2019-05', '2019-06', 
'2019-07', '2019-08', '2019-09', 
'2019-10', '2019-11', '2019-12'],
dtype='period[M]', freq='M')
生成年时间序列：
PeriodIndex(['2008', '2009', '2010', '2011',
 '2012', '2013', '2014', '2015',
 '2016', '2017', '2018', '2019'],
 dtype='period[A-DEC]', freq='A-DEC')
生成日时间序列：
PeriodIndex(['2018-01-01', '2018-01-02', 
'2018-01-03', '2018-01-04','2018-01-05', 
'2018-01-06', '2018-01-07', '2018-01-08', 
'2018-01-09', '2018-01-10'],
dtype='period[D]', freq='D')
生成周时间序列：
PeriodIndex(['2018-01-01/2018-01-07', 
'2018-01-08/2018-01-14','2018-01-15/2018-01-21', 
'2018-01-22/2018-01-28','2018-01-29/2018-02-04',
'2018-02-05/2018-02-11','2018-02-12/2018-02-18', 
'2018-02-19/2018-02-25','2018-02-26/2018-03-04', 
'2018-03-05/2018-03-11'],
dtype='period[W-SUN]', freq='W-SUN')
#画以时间为x轴的图,pandas的DataFrame自动将index列作为x轴
np.random.seed(2)
#生成日期序列
x=pd.date_range('2018/01/01','2019/12/31', freq='d')
#x=pd.period_range('2018/01/01','2019/12/31', freq='d')
#标准正态分布时间序列
y=np.random.standard_normal(len(x))
#将二者转换为pandas的数据格式
df=pd.DataFrame(y,columns=['标准正态分布'],index=x)
df.plot(figsize=(12,6))
plt.title('模拟标准正态分布随机数')
ax = plt.gca()  
ax.spines['right'].set_color('none') 
ax.spines['top'].set_color('none')  
plt.show()

时间序列做机器学习时间序列处理_pandas处理时间_02

03 时间样本频率转换

时间序列样本转换主要分两种：即高频数据向低频数据转换；低频数据向高频数据转换。用场景：行情交易数据一般是高频，基本面一般是月度、季度、年度等低频数据，量化分析的时候，常常要将基本面数据和行情交易数据结合起来进行统计回归分析，这时候就要用到样本数据频率的转换了。主要函数:df.resample()，df代表pandas的DataFrame格式数据，resample方法的参数参数中，freq表示重采样频率,例如‘M’、‘5min’,Second(15)；用于产生聚合值的函数名或数组函数,例如‘mean’、‘ohlc’、np.max等,默认是‘mean’,其他常用的有:‘first’、‘last’、‘median’、‘max’、‘min’，xis=0默认是纵轴,横轴设置axis=1。

高频数据向低频数据转化：

```python
#导入2019年1月10日上证指数的分时数据
#数据来源：同花顺
df=pd.read_excel('Table.xlsx')
df.head()
#设置时间作为索引
df=df.set_index(df['时间'])
#画图，pandas数据表自动将索引作为x轴
df['成交'].plot(figsize=(16,6),label='成交价格')
plt.title('上证综指2019年1月10日分时图',fontsize=15)
ax = plt.gca()  
ax.spines['right'].set_color('none') 
ax.spines['top'].set_color('none')  
plt.show()

时间序列做机器学习时间序列处理_时间序列做机器学习_03

#由于时间索引列只有时分秒，是object格式，加入年月日再进行样本变换
d=datetime(2019,1,10)
dt=pd.to_datetime([datetime.combine(d,t) for t in df['时间'].values])
#构建新的数据框
ts=pd.DataFrame(df['成交'].values,columns=['成交'],index=dt)
ts.head()
#5分钟样本,取最后一个数，标志默认左侧，
#所以第一个区间[9:20-9:25],显示9:20
ts.resample('5min',closed='right').last().head()
#5分钟采用，取最后一个数，
ts.resample('5min',closed='right',label='right').last().head()
#将其转换为每小时样本,默认closed='left',label='left'
#可以使用均值mean(),或取第一个数first(),或最后一个last()
ts.resample('H').mean()

低频数据向高频数据转换:

#frq='W'代表周
df=pd.DataFrame(np.random.randn(5,4),
            index=pd.date_range('1/4/2019',periods=5,freq='W'),
            columns=['GZ','BJ','SH','SZ'])
df
#将上述样本转换为日序列,缺失值使用前值补上
#如使用后值则用bfill()
df_daily=df.resample('D').ffill()
df_daily.head()
#根据period来重采样
df1=pd.DataFrame(np.random.randn(2,4),
            index=pd.period_range('1-2017','12-2018',freq='A'),
            columns=['GZ','BJ','SH','SZ'])
df1.head()
#Q-DEC: Quarterly, decenber
df1.resample('Q-DEC').ffill()

04 日期数据分组统计

#注意pd是pandas的简称，np是numpy的简称，使用之前先import
date=pd.date_range('1/1/2018', periods=500, freq='D')
ts=pd.Series(np.random.standard_normal(500),index=date)
ts.head()
输出结果：
2018-01-01    0.681604
2018-01-02    1.006493
2018-01-03   -0.942035
2018-01-04   -0.733425
2018-01-05   -1.035250
Freq: D, dtype: float64
#按月显示，不统计
#按年是A，季度是Q
tsp=ts.to_period('D')
tsp.head()
输出结果
2018-01-01    0.681604
2018-01-02    1.006493
2018-01-03   -0.942035
2018-01-04   -0.733425
2018-01-05   -1.035250
Freq: D, dtype: float64
#根据不同时期显示索引值
#按季度频率Q，月度M，年度A
tsp.index.asfreq('Q')
输出结果：
PeriodIndex(['2018Q1', '2018Q1', '2018Q1', '2018Q1', 
'2018Q1', '2018Q1','2018Q1', '2018Q1', '2018Q1', 
'2018Q1',...'2019Q2', '2019Q2', '2019Q2', '2019Q2', 
'2019Q2', '2019Q2', '2019Q2', '2019Q2', '2019Q2', 
 '2019Q2'],dtype='period[Q-DEC]', 
 length=500, freq='Q-DEC')
#按工作日统计
tsp.index.asfreq('B')
输出结果：
PeriodIndex(['2018-01-01', '2018-01-02', '2018-01-03', 
'2018-01-04','2018-01-05', '2018-01-08', '2018-01-08', 
'2018-01-08','2018-01-09', '2018-01-10', ... 
'2019-05-06', '2019-05-07', '2019-05-08', '2019-05-09', 
'2019-05-10', '2019-05-13', '2019-05-13', '2019-05-13', 
'2019-05-14', '2019-05-15'],
dtype='period[B]', length=500, freq='B')
#按周进行显示，求和汇总
#月：M，年：A，季度：Q
#sum()、mean（）,first(),last()
print(ts.resample('W').sum().head())
输出结果：
2018-01-07   -0.532703
2018-01-14   -3.905250
2018-01-21   -0.037820
2018-01-28   -4.010447
2018-02-04   -2.165019
Freq: W-SUN, dtype: float64
print(ts.resample('AS').sum())
# "AS"是每年第一天为开始日期, "A是每年最后一天
输出结果：
2018-01-01    0.434155
2019-01-01    0.171082
Freq: AS-JAN, dtype: float64
# 按年统计并显示
print(ts.resample('AS').sum().to_period('A'))
输出结果：
2018    0.434155
2019    0.171082
Freq: A-DEC, dtype: float64
# 按季度统计并显示
print(ts.resample('Q').sum().to_period('Q'))
输出结果：
2018Q1   -23.716613
2018Q2    -3.304391
2018Q3    15.039522
2018Q4    12.415637
2019Q1     1.153160
2019Q2    -0.982078
Freq: Q-DEC, dtype: float64
根据groupby进行resampling
#按月进行汇总求平均值
ts.groupby(lambda x:x.year).mean()
输出结果：
2018    0.001189
2019    0.001267
dtype: float64
#按周进行汇总求平均值
ts.groupby(lambda x:x.weekday).mean()
输出结果：
0   -0.021922
1   -0.005215
2   -0.002857
3   -0.037211
4   -0.133802
5    0.011450
6    0.198504
dtype: float64

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。