以下代码按顺序执行,在ipython中执行,如果再PyCharm中,按包调用即可,如
Series([1, 2, 3, 4, 5], index=labels)
,使用pd.Series([1, 2, 3, 4, 5], index=labels)
,pd为导入pandas时的昵称。
数据分析/创意产生
打开ipython
from IPython.display import IFrame
IFrame(src="http://pandas.pydata.org", width=800, height=350)
----------
<IPython.lib.display.IFrame at 0x222d1e0dc50>
import datetime
import pandas as pd
import pandas_datareader
from pandas import Series, DataFrame
print(pd.__version__)
----------
0.25.1
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('figure', figsize=(8, 7))
mpl.__version__
----------
3.0.2
创建/加载时间序列数据
- Python结构
- 雅虎财经
- CSV文件
Python结构
labels = ['a', 'b', 'c', 'd', 'e']
s = Series([1, 2, 3, 4, 5], index=labels)
s
----------
a 1
b 2
c 3
d 4
e 5
dtype: int64
'b' in s
----------
True
s['b']
----------
2
mapping = s.to_dict()
mapping
----------
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
Series(mapping)
----------
a 1
b 2
c 3
d 4
e 5
dtype: int64
雅虎财经
import pandas_datareader as pd_data
import datetime
aapl = pd_data.data.get_data_yahoo('AAPL',
start=datetime.datetime(2006, 10, 1),
end=datetime.datetime(2012, 1, 1))
aapl.head()
----------
High Low Open Close Volume Adj Close
Date
2006-10-02 10.838572 10.614285 10.728572 10.694285 178159800 9.283074
2006-10-03 10.707143 10.455714 10.635715 10.582857 197677200 9.186352
2006-10-04 10.780000 10.451428 10.585714 10.768572 207270700 9.347562
2006-10-05 10.880000 10.590000 10.647142 10.690000 170970800 9.279355
2006-10-06 10.720000 10.544286 10.631429 10.602858 116739700 9.203714
CSV文件
需要在当前目录下创建这个文件
data/aapl_ohlc.csv
Date,Open,High,Low,Close,Volume,Adj Close
2006-10-02,75.1,75.87,74.3,74.86,25451400,73.29
2006-10-03,74.45,74.95,73.19,74.08,28239600,72.52
2006-10-04,74.1,75.46,73.16,75.38,29610100,73.8
2006-10-05,74.53,76.16,74.13,74.83,24424400,73.26
2006-10-06,74.42,75.04,73.81,74.22,16677100,72.66
2006-10-09,73.8,75.08,73.53,74.63,15650800,73.06
2006-10-10,74.54,74.58,73.08,73.81,18985300,72.26
2006-10-11,73.42,73.98,72.6,73.23,20423400,71.69
2006-10-12,73.61,75.39,73.6,75.26,21173400,73.68
aapl.to_csv('data/aapl_ohlc.csv')
!head data/aapl_ohlc.csv
df = pd.read_csv('data/aapl_ohlc.csv', index_col='Date', parse_dates=True)
df.head()
----------
High Low Open Close Volume Adj Close
Date
2006-10-02 10.838572 10.614285 10.728572 10.694285 178159800 9.283074
2006-10-03 10.707143 10.455714 10.635715 10.582857 197677200 9.186352
2006-10-04 10.780000 10.451428 10.585714 10.768572 207270700 9.347562
2006-10-05 10.880000 10.590000 10.647142 10.690000 170970800 9.279355
2006-10-06 10.720000 10.544286 10.631429 10.602858 116739700 9.203714
df.index
----------
DatetimeIndex(['2006-10-02', '2006-10-03', '2006-10-04', '2006-10-05',
'2006-10-06', '2006-10-09', '2006-10-10', '2006-10-11',
'2006-10-12', '2006-10-13',
...
'2011-12-16', '2011-12-19', '2011-12-20', '2011-12-21',
'2011-12-22', '2011-12-23', '2011-12-27', '2011-12-28',
'2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', name='Date', length=1323, freq=None)
data.io的更换
原文中使用的 pandas.io.data
已经被弃用了,目前需要重新安装,本文中的使用方法为: import pandas_datareader as pd_data
官网:https://pandas-datareader.readthedocs.io/en/latest/
安装地址:https://pypi.org/project/pandas-datareader/
- v0.8.0 是支持 Python 2.7 的最后一个版本,所以务必安装 v0.8.1 及以上版本.
Series 和 DataFrame :第一步
- 索引
- 切片
- 创建新列
ts = df['Close'][-10:]
ts
----------
Date
2011-12-16 54.431427
2011-12-19 54.601429
2011-12-20 56.564285
2011-12-21 56.635715
2011-12-22 56.935715
2011-12-23 57.618572
2011-12-27 58.075714
2011-12-28 57.520000
2011-12-29 57.874287
2011-12-30 57.857143
Name: Close, dtype: float64
DataFrame是一系列对象的集合。将DataFrame列切割出来将返回一个系列。
type(ts)
----------
pandas.core.series.Series
date = ts.index[5]
date
----------
Timestamp('2011-12-23 00:00:00')
ts[date]
----------
57.61857223510742
ts[5]
----------
57.61857223510742
我们还可以选择多列。
df[['Open', 'Close']].head()
----------
Open Close
Date
2006-10-02 10.728572 10.694285
2006-10-03 10.635715 10.582857
2006-10-04 10.585714 10.768572
2006-10-05 10.647142 10.690000
2006-10-06 10.631429 10.602858
还可以动态添加新列
df['diff'] = df.Open - df.Close
df.head()
----------
High Low Open ... Volume Adj Close diff
Date ...
2006-10-02 10.838572 10.614285 10.728572 ... 178159800 9.283074 0.034286
2006-10-03 10.707143 10.455714 10.635715 ... 197677200 9.186352 0.052857
2006-10-04 10.780000 10.451428 10.585714 ... 207270700 9.347562 -0.182858
2006-10-05 10.880000 10.590000 10.647142 ... 170970800 9.279355 -0.042857
2006-10-06 10.720000 10.544286 10.631429 ... 116739700 9.203714 0.028571
[5 rows x 7 columns]
也能动态删除
del df['diff']
df.head()
----------
High Low Open Close Volume Adj Close
Date
2006-10-02 10.838572 10.614285 10.728572 10.694285 178159800 9.283074
2006-10-03 10.707143 10.455714 10.635715 10.582857 197677200 9.186352
2006-10-04 10.780000 10.451428 10.585714 10.768572 207270700 9.347562
2006-10-05 10.880000 10.590000 10.647142 10.690000 170970800 9.279355
2006-10-06 10.720000 10.544286 10.631429 10.602858 116739700 9.203714
常见的财务计算
- Moving Average 动态均线
- Returns 收益
close_px = df['Adj Close']
mavg = close_px.rolling(40).mean()
mavg[-10:]
----------
Date
2011-12-16 48.202029
2011-12-19 48.168981
2011-12-20 48.138538
2011-12-21 48.134446
2011-12-22 48.128090
2011-12-23 48.123874
2011-12-27 48.128772
2011-12-28 48.122138
Returns 定义为:
rets = close_px / close_px.shift(1) - 1
rets.head()
----------
Date
2006-10-02 NaN
2006-10-03 -0.010419
2006-10-04 0.017549
2006-10-05 -0.007297
2006-10-06 -0.008152
Name: Adj Close, dtype: float64
或者用 .pct_change()
,得到的结果也是一样的
close_px.pct_change().head()
----------
Date
2006-10-02 NaN
2006-10-03 -0.010419
2006-10-04 0.017549
2006-10-05 -0.007297
2006-10-06 -0.008152
Name: Adj Close, dtype: float64
绘图基础
Series 和 DataFrames 有一个关联的 .plot()
命令 ( 在幕后使用Matplotlib )
import matplotlib.pyplot as plt
close_px.plot(label='AAPL')
mavg.plot(label='mavg')
plt.legend()
----------
<matplotlib.legend.Legend at 0x182f3a8cac8>
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Q1KkfEk-1586415057997)(D:\workdata\2020\机器学习与深度学习\image\image-20200402162523592.png)]
可选:如果有时间的话…
import pandas as pd
import pandas_datareader as pdd
import datetime
df = pdd.data.get_data_yahoo(['AAPL', 'GE', 'GOOG', 'IBM', 'KO', 'MSFT', 'PEP'],
start=datetime.datetime(2010, 1, 1),
end=datetime.datetime(2013, 1, 1))['Adj Close']
df.head()
----------
Symbols AAPL GE GOOG ... KO MSFT PEP
Date ...
2009-12-31 26.131752 10.526512 308.832428 ... 19.278732 23.925440 44.622261
2010-01-04 26.538483 10.749147 312.204773 ... 19.292267 24.294369 44.945187
2010-01-05 26.584366 10.804806 310.829926 ... 19.058893 24.302216 45.488274
2010-01-06 26.161509 10.749147 302.994293 ... 19.052124 24.153070 45.033253
2010-01-07 26.113146 11.305734 295.940735 ... 19.004770 23.901886 44.747028
[5 rows x 7 columns]
import matplotlib.pyplot as plt
rets = df.pct_change()
plt.scatter(rets.PEP, rets.KO)
plt.xlabel('Returns PEP')
plt.ylabel('Returns KO')
----------
Text(0, 0.5, 'Returns KO')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u8DgxK7P-1586415057998)(D:\workdata\2020\机器学习与深度学习\image\image-20200402173026615.png)]
pandas的矩阵实际上要好一些,但我这里给出的是一个数组,如果要展示的话,还是用plt.show()
就行了
from pandas.plotting import scatter_matrix
scatter_matrix(rets, diagonal='kde', figsize=(10, 10))
----------
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6FFEE55C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6FFF3C7F0>,
...
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E6816D85F8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E681701B70>]],
dtype=object)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TvkjCnds-1586415057999)(D:\workdata\2020\机器学习与深度学习\image\image-20200402174915997.png)]
corr = rets.corr()
corr
----------
Symbols AAPL GE GOOG IBM KO MSFT PEP
Symbols
AAPL 1.000000 0.457677 0.533692 0.502091 0.389567 0.473377 0.308979
GE 0.457677 1.000000 0.463039 0.609821 0.594637 0.600791 0.534035
GOOG 0.533692 0.463039 1.000000 0.471161 0.401712 0.473264 0.330564
IBM 0.502091 0.609821 0.471161 1.000000 0.600568 0.646600 0.490903
KO 0.389567 0.594637 0.401712 0.600568 1.000000 0.530160 0.635659
MSFT 0.473377 0.600791 0.473264 0.646600 0.530160 1.000000 0.477452
PEP 0.308979 0.534035 0.330564 0.490903 0.635659 0.477452 1.000000
plt.imshow(corr, cmap='hot', interpolation='none')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns)
plt.yticks(range(len(corr)), corr.columns);
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K3AlvvFS-1586415058000)(D:\workdata\2020\机器学习与深度学习\image\image-20200402175002817.png)]
我们经常感兴趣的一件事是预期回报与我们承担的风险之间的关系。这两者之间往往存在一种取舍。这里用 plt.annotate
在散布图上做标记。
plt.scatter(rets.mean(), rets.std())
plt.xlabel('Expected returns')
plt.ylabel('Risk')
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
plt.annotate(
label,
xy=(x, y), xytext=(20, -20),
textcoords='offset points', ha='right', va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
plt.show()
----------
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TLxaHj7t-1586415058000)(D:\workdata\2020\机器学习与深度学习\image\image-20200402175925739.png)]
数据对齐和处理丢失的值
- Join, merge, concatenate
- 填充 NaNs
让我们从雅虎财经获取更多的证券,并建立一个调整后的收盘价的 DataFrame。
series_list = []
securities = ['AAPL', 'GOOG', 'IBM', 'MSFT']
for security in securities:
s = pd.io.data.get_data_yahoo(security, start=datetime.datetime(2011, 10, 1), end=datetime.datetime(2013, 1, 1))['Adj Close']
s.name = security # Rename series to match security name
series_list.append(s)
pd.concat
可以把多个 Series
连接到一个 DataFrame
中。
df = pd.concat(series_list, axis=1)
df.head()
----------
AAPL GOOG IBM MSFT
Date
2011-09-30 47.285904 256.558350 130.822800 20.321293
2011-10-03 46.452591 246.834808 129.640747 20.027370
2011-10-04 46.192177 250.012894 130.725555 20.688694
2011-10-05 46.905209 251.407669 132.304108 21.137737
2011-10-06 46.796078 256.393982 135.924973 21.505136
import numpy as np
df.ix[0, 'AAPL'] = np.nan
df.ix[1, ['GOOG', 'IBM']] = np.nan
df.ix[[1, 2, 3], 'MSFT'] = np.nan
df.head()
----------
AAPL GOOG IBM MSFT
Date
2011-09-30 NaN 256.558350 130.822800 20.321293
2011-10-03 46.452591 NaN NaN NaN
2011-10-04 46.192177 250.012894 130.725555 NaN
2011-10-05 46.905209 251.407669 132.304108 NaN
2011-10-06 46.796078 256.393982 135.924973 21.505136
panda对缺少值的计算提供了强大的支持。
(df.AAPL + df.GOOG).head()
----------
Date
2011-09-30 NaN
2011-10-03 NaN
2011-10-04 296.205070
2011-10-05 298.312878
2011-10-06 303.190060
dtype: float64
一种常用的时间序列缺失值的估算方法是正向填充法。
df.ffill().head()
----------
AAPL GOOG IBM MSFT
Date
2011-09-30 NaN 256.558350 130.822800 20.321293
2011-10-03 46.452591 256.558350 130.822800 20.321293
2011-10-04 46.192177 250.012894 130.725555 20.321293
2011-10-05 46.905209 251.407669 132.304108 20.321293
2011-10-06 46.796078 256.393982 135.924973 21.505136