基于TensorFlow-gpu2.0,利用LSTM框架进行实时预测比特币价格
利用kaggle给的数据集,链接:https://www.kaggle.com/mczielinski/bitcoin-historical-data#coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv
下载数据集后,解压,利用coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv文件
刚开始使用cpu跑的代码,CPU使用率为100%,内存也达到了15G,所以这次把代码改良了,采用GPU(GTX 1080Ti)跑,基本CPU32%,GPU12%,内存6G
比特币的价格数据是基于时间序列的,因此比特币的价格预测大多采用LSTM模型来实现。
长期短期记忆(LSTM)是一种特别适用于时间序列数据(或具有时间 / 空间 / 结构顺序的数据,例如电影、句子等)的深度学习模型,是预测加密货币的价格走向的理想模型。
在对应的conda环境中打开jupyter notebook,新建一个ipynb文件,然后输入下面代码。
import需要使用的库
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from matplotlib import pyplot as plt
%matplotlib inline
数据加载
raw_data = pd.read_csv("coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv")
查看原始数据
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099760 entries, 0 to 2099759
Data columns (total 8 columns):
# Column Dtype
--- ------ -----
0 Timestamp int64
1 Open float64
2 High float64
3 Low float64
4 Close float64
5 Volume_(BTC) float64
6 Volume_(Currency) float64
7 Weighted_Price float64
dtypes: float64(7), int64(1)
memory usage: 128.2 MB
在的数据一共有2099760条,数据由Timestamp、Open、High、Low、Close、Volume_(BTC)、Volume_(Currency)、Weighted_Price这几列组成。其中除去Timestamp列以外,其余的数据列都是float64数据类型。
现在查看前10行数据
raw_data.head(10)
Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
0 | 1417411980 | 300.0 | 300.0 | 300.0 | 300.0 | 0.01 | 3.0 | 300.0 |
1 | 1417412040 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1417412100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1417412160 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1417412220 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 1417412280 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 1417412340 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 1417412400 | 300.0 | 300.0 | 300.0 | 300.0 | 0.01 | 3.0 | 300.0 |
8 | 1417412460 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 1417412520 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
删除包含NaN值的任何行,把处理后的数据给data
# 删除包含NaN值的任何行
data = raw_data.dropna(axis = 0)
data.head(10)
Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
0 | 1417411980 | 300.00 | 300.0 | 300.00 | 300.0 | 0.010000 | 3.00000 | 300.000000 |
7 | 1417412400 | 300.00 | 300.0 | 300.00 | 300.0 | 0.010000 | 3.00000 | 300.000000 |
51 | 1417415040 | 370.00 | 370.0 | 370.00 | 370.0 | 0.010000 | 3.70000 | 370.000000 |
77 | 1417416600 | 370.00 | 370.0 | 370.00 | 370.0 | 0.026556 | 9.82555 | 370.000000 |
1436 | 1417498140 | 377.00 | 377.0 | 377.00 | 377.0 | 0.010000 | 3.77000 | 377.000000 |
1766 | 1417517940 | 377.75 | 378.0 | 377.75 | 378.0 | 4.000000 | 1511.93750 | 377.984375 |
1771 | 1417518240 | 378.00 | 378.0 | 378.00 | 378.0 | 4.900000 | 1852.20000 | 378.000000 |
1772 | 1417518300 | 378.00 | 378.0 | 378.00 | 378.0 | 5.200000 | 1965.60000 | 378.000000 |
2230 | 1417545780 | 378.00 | 378.0 | 378.00 | 378.0 | 0.100000 | 37.80000 | 378.000000 |
2245 | 1417546680 | 378.00 | 378.0 | 378.00 | 378.0 | 0.793600 | 299.98080 | 378.000000 |
先查看下数据是否含有nan的数据,可以看到我们的数据中没有nan的数据
data.isnull().sum()
Timestamp 0
Open 0
High 0
Low 0
Close 0
Volume_(BTC) 0
Volume_(Currency) 0
Weighted_Price 0
dtype: int64
可以看出现在已经没有NaN的数据了
再查看下0数据,可以看到我们的数据中含有0值,我们需要对0值做下处理
(data == 0).astype(int).any()
Timestamp False
Open False
High False
Low False
Close False
Volume_(BTC) False
Volume_(Currency) False
Weighted_Price False
dtype: bool
处理0数据的方式是使用上个列值进行前向填充
data['Weighted_Price'].replace(0, np.nan, inplace=True)
data['Weighted_Price'].fillna(method='ffill', inplace=True)
data['Open'].replace(0, np.nan, inplace=True)
data['Open'].fillna(method='ffill', inplace=True)
data['High'].replace(0, np.nan, inplace=True)
data['High'].fillna(method='ffill', inplace=True)
data['Low'].replace(0, np.nan, inplace=True)
data['Low'].fillna(method='ffill', inplace=True)
data['Close'].replace(0, np.nan, inplace=True)
data['Close'].fillna(method='ffill', inplace=True)
data['Volume_(BTC)'].replace(0, np.nan, inplace=True)
data['Volume_(BTC)'].fillna(method='ffill', inplace=True)
data['Volume_(Currency)'].replace(0, np.nan, inplace=True)
data['Volume_(Currency)'].fillna(method='ffill', inplace=True)
E:\360Anaconda\envs\tf2.1\lib\site-packages\pandas\core\generic.py:6746: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
E:\360Anaconda\envs\tf2.1\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
(data == 0).astype(int).any()
Timestamp False
Open False
High False
Low False
Close False
Volume_(BTC) False
Volume_(Currency) False
Weighted_Price False
dtype: bool
再看下数据的分布跟走势,这个时候曲线已经非常的连续
plt.plot(data['Weighted_Price'], label='Price')
plt.ylabel('Price')
plt.legend()
plt.show()
训练数据集和测试数据集划分
将数据归一化到0-1
data_set = data.drop('Timestamp', axis=1).values
data_set = data_set.astype('float32')
mms = MinMaxScaler(feature_range=(0, 1))
data_set = mms.fit_transform(data_set)
以2:8划分测试数据集跟训练数据集
ratio = 0.8
train_size = int(len(data_set) * ratio)
test_size = len(data_set) - train_size
train, test = data_set[0:train_size,:], data_set[train_size:len(data_set),:]
创建训练数据集跟测试数据集,以1天作为窗口期来创建我们的训练数据集跟测试数据集。
def create_dataset(data):
window = 1
label_index = 6
x, y = [], []
for i in range(len(data) - window):
x.append(data[i:(i + window), :])
y.append(data[i + window, label_index])
return np.array(x), np.array(y)
train_x, train_y = create_dataset(train)
test_x, test_y = create_dataset(test)
loss为平均绝对误差(Mean Absolute Error,MAE)
def create_model():
model = Sequential()
model.add(LSTM(50, input_shape=(train_x.shape[1], train_x.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
model.summary()
return model
model = create_model()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 50) 11600
_________________________________________________________________
dense (Dense) (None, 1) 51
=================================================================
Total params: 11,651
Trainable params: 11,651
Non-trainable params: 0
_________________________________________________________________
这里节约时间,只训练20代,利用tensorflow-gpu=2.x,其中也有Keras,使用GPU训练,会更快
history = model.fit(train_x, train_y, epochs=20, batch_size=64, validation_data=(test_x, test_y), verbose=1, shuffle=False)
Epoch 1/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0017 - val_loss: 0.0303
Epoch 2/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0014 - val_loss: 0.0188
Epoch 3/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0011 - val_loss: 0.0135
Epoch 4/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0012 - val_loss: 0.0145
Epoch 5/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0011 - val_loss: 0.0128
Epoch 6/20
24884/24884 [==============================] - 77s 3ms/step - loss: 0.0011 - val_loss: 0.0136
Epoch 7/20
24884/24884 [==============================] - 77s 3ms/step - loss: 0.0011 - val_loss: 0.0135
Epoch 8/20
24884/24884 [==============================] - 76s 3ms/step - loss: 9.6527e-04 - val_loss: 0.0102
Epoch 9/20
24884/24884 [==============================] - 76s 3ms/step - loss: 8.4701e-04 - val_loss: 0.0083
Epoch 10/20
24884/24884 [==============================] - 76s 3ms/step - loss: 7.4637e-04 - val_loss: 0.0066
Epoch 11/20
24884/24884 [==============================] - 76s 3ms/step - loss: 6.7190e-04 - val_loss: 0.0059
Epoch 12/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.7592e-04 - val_loss: 0.0050
Epoch 13/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.3660e-04 - val_loss: 0.0053
Epoch 14/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.3742e-04 - val_loss: 0.0050
Epoch 15/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.2245e-04 - val_loss: 0.0053
Epoch 16/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.8314e-04 - val_loss: 0.0046
Epoch 17/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.8415e-04 - val_loss: 0.0054
Epoch 18/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.7891e-04 - val_loss: 0.0053
Epoch 19/20
24884/24884 [==============================] - 77s 3ms/step - loss: 4.6439e-04 - val_loss: 0.0048
Epoch 20/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.4422e-04 - val_loss: 0.0044
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()
train_x, train_y = create_dataset(train)
test_x, test_y = create_dataset(test)
预测
predict = model.predict(test_x)
plt.plot(predict, label='predict')
plt.plot(test_y, label='ground true')
plt.legend()
plt.show()
这只是作为数据分析的一个学习例子使用。 代码放在我的码云里,链接https://gitee.com/rengarwang/LSTM-forecast-price