双向注意力LSTM神经网络文本分类
原理讲解
TextAttBiRNN是在双向LSTM文本分类模型的基础上改进的,主要是引入了注意力机制(Attention)。对于双向LSTM编码得到的表征向量,模型能够通过注意力机制,关注与决策最相关的信息。其中注意力机制最先在论文 Neural Machine Translation by Jointly Learning to Align and Translate 中被提出,而此处对于注意力机制的实现参照了论文 Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems。
注意力机制参考
- 深度学习中的注意力模型
- 深度学习注意力机制
网络结构
是一个概率向量。
In the paper Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems, the feed forward attention is simplified as follows,
Function a, a learnable function, is recognized as a feed forward network. In this formulation, attention can be seen as producing a fixed-length embedding c of the input sequence by computing an adaptive weighted average of the state sequence h.
本文实现
TextAttBiRNN 的网络结构:
输入层
与以前一样,输入层我们依旧定义为输入长度,每个词经过一个embedding_dim=50的embedding矩阵,最终输出400×50的表示矩阵。
Bi-LSTM
Bi-LSTM层作为一种特征编码层,这层可以提取每个词语的上下文特征,然后将双向的特征进行拼接,然后依旧将每个词语的特征进行输出,因此输出为400×256的特征矩阵
Attention层
Attention层对这个网络中对每个词语进行了加权求和,这个权重是通过训练不断训练出来的,这层我们的输入x为400×256,我们初始化权重矩阵W为256×1维,然后对x与W进行点乘、归一化(公式的前两个),这样就可以得到400×1的矩阵a(代码中还做了一个tanh的操作),这个矩阵代表的是每个词对应的权重,权重大的词代表注意力大的,这个词的贡献程度更大,最后对每个词语进行加权平均,aT与x进行点乘,得到1×256,这是最终加权平均后输出的总特征向量。
输出层
与以前实验相同,使用全连接层,softmax作为激活函数进行输出。
定义网络结构
由于keras中没有实现attention,所以构建了一个自定义层Attention,从keras中继承了Layer。
在tf.keras.layers.Layer
的官方文档中,推荐凡是tf.keras.layers.Layer
的派生类都要实现__init__()
,build()
, call()
这三个方法。
-
__init__()
:保存成员变量的设置。即对Layer进行初始化。 -
build()
:定义权重的方法,在call()函数第一次执行时会被调用一次,这时候可以知道输入数据的shape。也是初始化。但__init__()
函数中只初始化了输出数据的shape,而输入数据的shape需要在build()
函数中动态获取,这也解释了为什么在有__init__()
函数时还需要使用build()
函数。 -
call()
:这是定义层功能的方法,在该layer被调用时会被执行。如果你写的层不需要支持masking,那么你只需要关心call的第一个参数:输入张量。
还有compute_output_shape
方法:
-
compute_output_shape(input_shape)
:如果你的层修改了输入数据的shape,你应该在这里指定shape变化的方法,这个函数使得Keras可以做自动shape推断。
在bulid中构建w和b,在call中做计算。
from tensorflow.keras import backend as K
#from tensorflow.python.keras import backend as K
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer
#from keras.engine.topology import Layer
class Attention(Layer):
'''
返回值:
返回的不是attention权重,而是每个timestep乘以权重后相加得到的向量。
输入:
输入是rnn的timesteps,也是最长输入序列的长度。keras
'''
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, **kwargs): # 参数**kwargs代表按字典方式继承父类
"""
Keras Layer that implements an Attention mechanism for temporal data.
Supports Masking.
Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
# Input shape
3D tensor with shape: `(samples, steps, features)`.
# Output shape
2D tensor with shape: `(samples, features)`.
:param kwargs:
Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
The dimensions are inferred based on the output shape of the RNN.
Example:
# 1
model.add(LSTM(64, return_sequences=True))
model.add(Attention())
# next add a Dense layer (for classification/regression) or whatever...
# 2
hidden = LSTM(64, return_sequences=True)(words)
sentence = Attention()(hidden)
# next add a Dense layer (for classification/regression) or whatever...
"""
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim # 即后面从TextAttBiRNN中传进来的maxlen
self.features_dim = 0
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
# assert断言 相当于if条件成立才能成功往下走
assert len(input_shape) == 3
# self.add_weight()继承自Layer,用于给变量添加权重
# 输入的向量维度是隐层x(none,400,256),wx+b,w的维度是256*1
self.W = self.add_weight(shape=(input_shape[-1],),
initializer=self.init,
name='{}_W'.format(self.name),
regularizer=self.W_regularizer,
constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),
initializer='zero',
name='{}_b'.format(self.name),
regularizer=self.b_regularizer,
constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
# do not pass the mask to the next layers
# 后面的层不需要mask了,所以这里可以直接返回none
return None
def call(self, x, mask=None):
features_dim = self.features_dim
# 这里应该是 step_dim是我们指定的参数,它等于input_shape[1],也就是rnn的timesteps
step_dim = self.step_dim
# 输入和参数分别reshape再点乘后,tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
# 所以应该有 e = K.reshape(..., (-1, timesteps))
e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim)) # e = K.dot(x, self.W)
if self.bias:
e += self.b
# RNN一般默认激活函数为tanh, 对attention来说激活函数差别不打,因为要做softmax
e = K.tanh(e)
a = K.exp(e)
# apply mask after the exp. will be re-normalized next
if mask is not None:
# 如果前面的层有mask,那么后面这些被mask掉的timestep肯定是不能参与计算输出的,也就是将他们的attention权重设为0
# cast the mask to floatX to avoid float64 upcasting in theano
a *= K.cast(mask, K.floatx())
# in some cases especially in the early stages of training the sum may be almost zero
# and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
# cast是做类型转换,keras计算时会检查类型,可能是因为用gpu的原因
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
# a = K.expand_dims(a, axis=-1) , axis默认为-1, 表示在最后扩充一个维度。
# 比如shape = (3,)变成 (3, 1)
a = K.expand_dims(a)
# 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
# a*x的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
# a*x在axis=1上取和,返回值的shape为 (batch_size, 1, units)
c = K.sum(a * x, axis=1)
return c
def compute_output_shape(self, input_shape):
# 返回的结果是c,其shape为 (batch_size, units)
return input_shape[0], self.features_dim
然后构建TextAttRNN类
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Dropout, Bidirectional, LSTM
class TextAttBiRNN(object):
def __init__(self, maxlen, max_features, embedding_dims,
class_num=5,
last_activation='softmax'):
self.maxlen = maxlen
self.max_features = max_features
self.embedding_dims = embedding_dims
self.class_num = class_num
self.last_activation = last_activation
def get_model(self):
input = Input((self.maxlen,))
embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen)(input)
x = Bidirectional(LSTM(128, return_sequences=True))(embedding) # LSTM or GRU
x = Attention(self.maxlen)(x)
output = Dense(self.class_num, activation=self.last_activation)(x)
model = Model(inputs=input, outputs=output)
return model
数据处理和训练
正常类型的输入,和textCNN和textRNN一样。
from tensorflow.keras.preprocessing import sequence
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
import sys
sys.path.append('../data/lesson2_data')
from utils import *
# 路径等配置
data_dir = "../data/lesson2_data/data"
vocab_file = "../data/lesson2_data/vocab/vocab.txt"
vocab_size = 40000
# 神经网络配置
max_features = 40001
maxlen = 400
batch_size = 128
embedding_dims = 50
epochs = 10
print('数据预处理与加载数据...')
# 如果不存在词汇表,重建
if not os.path.exists(vocab_file):
build_vocab(data_dir, vocab_file, vocab_size)
# 获得 词汇/类别 与id映射字典
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_file)
# 全部数据
x, y = read_files(data_dir)
data = list(zip(x,y))
del x,y
# 乱序
random.shuffle(data)
# 切分训练集和测试集
train_data, test_data = train_test_split(data)
# 对文本的词id和类别id进行编码
x_train = encode_sentences([content[0] for content in train_data], word_to_id)
y_train = to_categorical(encode_cate([content[1] for content in train_data], cat_to_id))
x_test = encode_sentences([content[0] for content in test_data], word_to_id)
y_test = to_categorical(encode_cate([content[1] for content in test_data], cat_to_id))
print('对序列做padding,保证是 samples*timestep 的维度')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('构建模型...')
model = TextAttBiRNN(maxlen, max_features, embedding_dims).get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])
print('Train...')
early_stopping = EarlyStopping(monitor='val_accuracy', patience=2, mode='max')
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
callbacks=[early_stopping],
validation_data=(x_test, y_test))
print('Test...')
result = model.predict(x_test)
总结:没训练完,带attention的BiLSTM训练更更慢了。。。
参考网址:
Keras实现用于文本分类的attention机制[Keras] 使用Keras编写自定义网络层(layer)