基于TensorFlow-gpu2.0,利用LSTM框架进行实时预测比特币价格

利用kaggle给的数据集,链接:https://www.kaggle.com/mczielinski/bitcoin-historical-data#coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv

下载数据集后,解压,利用coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv文件

刚开始使用cpu跑的代码,CPU使用率为100%,内存也达到了15G,所以这次把代码改良了,采用GPU(GTX 1080Ti)跑,基本CPU32%,GPU12%,内存6G

比特币的价格数据是基于时间序列的,因此比特币的价格预测大多采用LSTM模型来实现。
长期短期记忆(LSTM)是一种特别适用于时间序列数据(或具有时间 / 空间 / 结构顺序的数据,例如电影、句子等)的深度学习模型,是预测加密货币的价格走向的理想模型。

在对应的conda环境中打开jupyter notebook,新建一个ipynb文件,然后输入下面代码。

import需要使用的库

import pandas as pd
import numpy as np
import tensorflow as tf

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from matplotlib import pyplot as plt
%matplotlib inline

数据加载

raw_data = pd.read_csv("coinbaseUSD_1-min_data_2014-12-01_to_2019-01-09.csv")

查看原始数据

raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2099760 entries, 0 to 2099759
Data columns (total 8 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Timestamp          int64  
 1   Open               float64
 2   High               float64
 3   Low                float64
 4   Close              float64
 5   Volume_(BTC)       float64
 6   Volume_(Currency)  float64
 7   Weighted_Price     float64
dtypes: float64(7), int64(1)
memory usage: 128.2 MB

在的数据一共有2099760条,数据由Timestamp、Open、High、Low、Close、Volume_(BTC)、Volume_(Currency)、Weighted_Price这几列组成。其中除去Timestamp列以外,其余的数据列都是float64数据类型。

现在查看前10行数据

raw_data.head(10)



Timestamp

Open

High

Low

Close

Volume_(BTC)

Volume_(Currency)

Weighted_Price

0

1417411980

300.0

300.0

300.0

300.0

0.01

3.0

300.0

1

1417412040

NaN

NaN

NaN

NaN

NaN

NaN

NaN

2

1417412100

NaN

NaN

NaN

NaN

NaN

NaN

NaN

3

1417412160

NaN

NaN

NaN

NaN

NaN

NaN

NaN

4

1417412220

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5

1417412280

NaN

NaN

NaN

NaN

NaN

NaN

NaN

6

1417412340

NaN

NaN

NaN

NaN

NaN

NaN

NaN

7

1417412400

300.0

300.0

300.0

300.0

0.01

3.0

300.0

8

1417412460

NaN

NaN

NaN

NaN

NaN

NaN

NaN

9

1417412520

NaN

NaN

NaN

NaN

NaN

NaN

NaN

删除包含NaN值的任何行,把处理后的数据给data

# 删除包含NaN值的任何行
data = raw_data.dropna(axis = 0)
data.head(10)



Timestamp

Open

High

Low

Close

Volume_(BTC)

Volume_(Currency)

Weighted_Price

0

1417411980

300.00

300.0

300.00

300.0

0.010000

3.00000

300.000000

7

1417412400

300.00

300.0

300.00

300.0

0.010000

3.00000

300.000000

51

1417415040

370.00

370.0

370.00

370.0

0.010000

3.70000

370.000000

77

1417416600

370.00

370.0

370.00

370.0

0.026556

9.82555

370.000000

1436

1417498140

377.00

377.0

377.00

377.0

0.010000

3.77000

377.000000

1766

1417517940

377.75

378.0

377.75

378.0

4.000000

1511.93750

377.984375

1771

1417518240

378.00

378.0

378.00

378.0

4.900000

1852.20000

378.000000

1772

1417518300

378.00

378.0

378.00

378.0

5.200000

1965.60000

378.000000

2230

1417545780

378.00

378.0

378.00

378.0

0.100000

37.80000

378.000000

2245

1417546680

378.00

378.0

378.00

378.0

0.793600

299.98080

378.000000

先查看下数据是否含有nan的数据,可以看到我们的数据中没有nan的数据

data.isnull().sum()
Timestamp            0
Open                 0
High                 0
Low                  0
Close                0
Volume_(BTC)         0
Volume_(Currency)    0
Weighted_Price       0
dtype: int64

可以看出现在已经没有NaN的数据了

再查看下0数据,可以看到我们的数据中含有0值,我们需要对0值做下处理

(data == 0).astype(int).any()
Timestamp            False
Open                 False
High                 False
Low                  False
Close                False
Volume_(BTC)         False
Volume_(Currency)    False
Weighted_Price       False
dtype: bool

处理0数据的方式是使用上个列值进行前向填充

data['Weighted_Price'].replace(0, np.nan, inplace=True)
data['Weighted_Price'].fillna(method='ffill', inplace=True)
data['Open'].replace(0, np.nan, inplace=True)
data['Open'].fillna(method='ffill', inplace=True)
data['High'].replace(0, np.nan, inplace=True)
data['High'].fillna(method='ffill', inplace=True)
data['Low'].replace(0, np.nan, inplace=True)
data['Low'].fillna(method='ffill', inplace=True)
data['Close'].replace(0, np.nan, inplace=True)
data['Close'].fillna(method='ffill', inplace=True)
data['Volume_(BTC)'].replace(0, np.nan, inplace=True)
data['Volume_(BTC)'].fillna(method='ffill', inplace=True)
data['Volume_(Currency)'].replace(0, np.nan, inplace=True)
data['Volume_(Currency)'].fillna(method='ffill', inplace=True)
E:\360Anaconda\envs\tf2.1\lib\site-packages\pandas\core\generic.py:6746: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
E:\360Anaconda\envs\tf2.1\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
(data == 0).astype(int).any()
Timestamp            False
Open                 False
High                 False
Low                  False
Close                False
Volume_(BTC)         False
Volume_(Currency)    False
Weighted_Price       False
dtype: bool

再看下数据的分布跟走势,这个时候曲线已经非常的连续

plt.plot(data['Weighted_Price'], label='Price')
plt.ylabel('Price')
plt.legend()
plt.show()

tensorflow预测union lotto tensorflow lstm 预测_深度学习

训练数据集和测试数据集划分

将数据归一化到0-1

data_set = data.drop('Timestamp', axis=1).values
data_set = data_set.astype('float32')
mms = MinMaxScaler(feature_range=(0, 1))
data_set = mms.fit_transform(data_set)

以2:8划分测试数据集跟训练数据集

ratio = 0.8
train_size = int(len(data_set) * ratio)
test_size = len(data_set) - train_size
train, test = data_set[0:train_size,:], data_set[train_size:len(data_set),:]

创建训练数据集跟测试数据集,以1天作为窗口期来创建我们的训练数据集跟测试数据集。

def create_dataset(data):
    window = 1
    label_index = 6
    x, y = [], []
    for i in range(len(data) - window):
        x.append(data[i:(i + window), :])
        y.append(data[i + window, label_index])
    return np.array(x), np.array(y)
train_x, train_y = create_dataset(train)
test_x, test_y = create_dataset(test)

loss为平均绝对误差(Mean Absolute Error,MAE)

def create_model():
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_x.shape[1], train_x.shape[2])))
    model.add(Dense(1))
    model.compile(loss='mae', optimizer='adam')
    model.summary()
    return model

model = create_model()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 50)                11600     
_________________________________________________________________
dense (Dense)                (None, 1)                 51        
=================================================================
Total params: 11,651
Trainable params: 11,651
Non-trainable params: 0
_________________________________________________________________

这里节约时间,只训练20代,利用tensorflow-gpu=2.x,其中也有Keras,使用GPU训练,会更快

history = model.fit(train_x, train_y, epochs=20, batch_size=64, validation_data=(test_x, test_y), verbose=1, shuffle=False)
Epoch 1/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0017 - val_loss: 0.0303
Epoch 2/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0014 - val_loss: 0.0188
Epoch 3/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0011 - val_loss: 0.0135
Epoch 4/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0012 - val_loss: 0.0145
Epoch 5/20
24884/24884 [==============================] - 76s 3ms/step - loss: 0.0011 - val_loss: 0.0128
Epoch 6/20
24884/24884 [==============================] - 77s 3ms/step - loss: 0.0011 - val_loss: 0.0136
Epoch 7/20
24884/24884 [==============================] - 77s 3ms/step - loss: 0.0011 - val_loss: 0.0135
Epoch 8/20
24884/24884 [==============================] - 76s 3ms/step - loss: 9.6527e-04 - val_loss: 0.0102
Epoch 9/20
24884/24884 [==============================] - 76s 3ms/step - loss: 8.4701e-04 - val_loss: 0.0083
Epoch 10/20
24884/24884 [==============================] - 76s 3ms/step - loss: 7.4637e-04 - val_loss: 0.0066
Epoch 11/20
24884/24884 [==============================] - 76s 3ms/step - loss: 6.7190e-04 - val_loss: 0.0059
Epoch 12/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.7592e-04 - val_loss: 0.0050
Epoch 13/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.3660e-04 - val_loss: 0.0053
Epoch 14/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.3742e-04 - val_loss: 0.0050
Epoch 15/20
24884/24884 [==============================] - 76s 3ms/step - loss: 5.2245e-04 - val_loss: 0.0053
Epoch 16/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.8314e-04 - val_loss: 0.0046
Epoch 17/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.8415e-04 - val_loss: 0.0054
Epoch 18/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.7891e-04 - val_loss: 0.0053
Epoch 19/20
24884/24884 [==============================] - 77s 3ms/step - loss: 4.6439e-04 - val_loss: 0.0048
Epoch 20/20
24884/24884 [==============================] - 76s 3ms/step - loss: 4.4422e-04 - val_loss: 0.0044
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

tensorflow预测union lotto tensorflow lstm 预测_High_02

train_x, train_y = create_dataset(train)
test_x, test_y = create_dataset(test)

预测

predict = model.predict(test_x)
plt.plot(predict, label='predict')
plt.plot(test_y, label='ground true')
plt.legend()
plt.show()

tensorflow预测union lotto tensorflow lstm 预测_数据_03

这只是作为数据分析的一个学习例子使用。 代码放在我的码云里,链接https://gitee.com/rengarwang/LSTM-forecast-price