nlp注意力可视化 lstm 注意力模型

转载

mob64ca1400bfa8 2024-06-03 12:02:35

文章标签 nlp注意力可视化深度学习 LSTM attention keras 文章分类 NLP 人工智能

双向注意力LSTM神经网络文本分类

原理讲解

TextAttBiRNN是在双向LSTM文本分类模型的基础上改进的，主要是引入了注意力机制（Attention）。对于双向LSTM编码得到的表征向量，模型能够通过注意力机制，关注与决策最相关的信息。其中注意力机制最先在论文 Neural Machine Translation by Jointly Learning to Align and Translate 中被提出，而此处对于注意力机制的实现参照了论文 Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems。

注意力机制参考

深度学习中的注意力模型
深度学习注意力机制

网络结构

nlp注意力可视化 lstm 注意力模型_LSTM

$nlp注意力可视化 lstm 注意力模型_attention_02$ 是一个概率向量。

In the paper Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems, the feed forward attention is simplified as follows,

nlp注意力可视化 lstm 注意力模型_深度学习_03

Function a, a learnable function, is recognized as a feed forward network. In this formulation, attention can be seen as producing a fixed-length embedding c of the input sequence by computing an adaptive weighted average of the state sequence h.

本文实现

TextAttBiRNN 的网络结构：

nlp注意力可视化 lstm 注意力模型_深度学习_04

nlp注意力可视化 lstm 注意力模型_LSTM_05

输入层

与以前一样，输入层我们依旧定义为输入长度，每个词经过一个embedding_dim=50的embedding矩阵，最终输出400×50的表示矩阵。

Bi-LSTM

Bi-LSTM层作为一种特征编码层，这层可以提取每个词语的上下文特征，然后将双向的特征进行拼接，然后依旧将每个词语的特征进行输出，因此输出为400×256的特征矩阵

Attention层

Attention层对这个网络中对每个词语进行了加权求和，这个权重是通过训练不断训练出来的，这层我们的输入x为400×256，我们初始化权重矩阵W为256×1维，然后对x与W进行点乘、归一化（公式的前两个），这样就可以得到400×1的矩阵a（代码中还做了一个tanh的操作），这个矩阵代表的是每个词对应的权重，权重大的词代表注意力大的，这个词的贡献程度更大，最后对每个词语进行加权平均，aT与x进行点乘，得到1×256，这是最终加权平均后输出的总特征向量。

输出层

与以前实验相同，使用全连接层，softmax作为激活函数进行输出。

定义网络结构

由于keras中没有实现attention，所以构建了一个自定义层Attention，从keras中继承了Layer。

在tf.keras.layers.Layer的官方文档中，推荐凡是tf.keras.layers.Layer的派生类都要实现__init__()，build(), call()这三个方法。

__init__()：保存成员变量的设置。即对Layer进行初始化。
build()：定义权重的方法，在call()函数第一次执行时会被调用一次，这时候可以知道输入数据的shape。也是初始化。但__init__()函数中只初始化了输出数据的shape，而输入数据的shape需要在build()函数中动态获取，这也解释了为什么在有__init__()函数时还需要使用build()函数。
call()：这是定义层功能的方法，在该layer被调用时会被执行。如果你写的层不需要支持masking，那么你只需要关心call的第一个参数：输入张量。

还有compute_output_shape方法：

compute_output_shape(input_shape)：如果你的层修改了输入数据的shape，你应该在这里指定shape变化的方法，这个函数使得Keras可以做自动shape推断。

在bulid中构建w和b，在call中做计算。

from tensorflow.keras import backend as K
#from tensorflow.python.keras import backend as K
from tensorflow.keras import initializers, regularizers, constraints
from tensorflow.keras.layers import Layer
#from keras.engine.topology import Layer

class Attention(Layer):
	'''
		返回值：
			返回的不是attention权重，而是每个timestep乘以权重后相加得到的向量。
		输入:
			输入是rnn的timesteps，也是最长输入序列的长度。keras
	'''
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):   # 参数**kwargs代表按字典方式继承父类
              
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            # 1
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
            # next add a Dense layer (for classification/regression) or whatever...
            # 2
            hidden = LSTM(64, return_sequences=True)(words)
            sentence = Attention()(hidden)
            # next add a Dense layer (for classification/regression) or whatever...
        """
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim   # 即后面从TextAttBiRNN中传进来的maxlen
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        # assert断言 相当于if条件成立才能成功往下走
        assert len(input_shape) == 3

        # self.add_weight()继承自Layer，用于给变量添加权重
        # 输入的向量维度是隐层x(none,400,256)，wx+b，w的维度是256*1
        self.W = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        # 后面的层不需要mask了，所以这里可以直接返回none
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        # 这里应该是 step_dim是我们指定的参数，它等于input_shape[1],也就是rnn的timesteps
        step_dim = self.step_dim

        # 输入和参数分别reshape再点乘后，tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
		# 所以应该有 e = K.reshape(..., (-1, timesteps))
        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        # RNN一般默认激活函数为tanh, 对attention来说激活函数差别不打，因为要做softmax
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # 如果前面的层有mask，那么后面这些被mask掉的timestep肯定是不能参与计算输出的，也就是将他们的attention权重设为0
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # cast是做类型转换，keras计算时会检查类型，可能是因为用gpu的原因
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        # a = K.expand_dims(a, axis=-1) , axis默认为-1， 表示在最后扩充一个维度。
		# 比如shape = (3,)变成 (3, 1)
        a = K.expand_dims(a)
        # 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)

        # a*x的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
		# a*x在axis=1上取和，返回值的shape为 (batch_size, 1, units)
        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape):
        # 返回的结果是c，其shape为 (batch_size, units)
        return input_shape[0], self.features_dim

然后构建TextAttRNN类

from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Embedding, Dense, Dropout, Bidirectional, LSTM

class TextAttBiRNN(object):
    def __init__(self, maxlen, max_features, embedding_dims,
                 class_num=5,
                 last_activation='softmax'):
        self.maxlen = maxlen
        self.max_features = max_features
        self.embedding_dims = embedding_dims
        self.class_num = class_num
        self.last_activation = last_activation

    def get_model(self):
        input = Input((self.maxlen,))

        embedding = Embedding(self.max_features, self.embedding_dims, input_length=self.maxlen)(input)
        x = Bidirectional(LSTM(128, return_sequences=True))(embedding)  # LSTM or GRU
        x = Attention(self.maxlen)(x)

        output = Dense(self.class_num, activation=self.last_activation)(x)
        model = Model(inputs=input, outputs=output)
        return model

数据处理和训练

正常类型的输入，和textCNN和textRNN一样。

from tensorflow.keras.preprocessing import sequence
import random
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
import sys
sys.path.append('../data/lesson2_data')
from utils import *

# 路径等配置
data_dir = "../data/lesson2_data/data"
vocab_file = "../data/lesson2_data/vocab/vocab.txt"
vocab_size = 40000

# 神经网络配置
max_features = 40001
maxlen = 400
batch_size = 128
embedding_dims = 50
epochs = 10

print('数据预处理与加载数据...')
# 如果不存在词汇表，重建
if not os.path.exists(vocab_file):  
    build_vocab(data_dir, vocab_file, vocab_size)
# 获得 词汇/类别 与id映射字典
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_file)

# 全部数据
x, y = read_files(data_dir)
data = list(zip(x,y))
del x,y
# 乱序
random.shuffle(data)
# 切分训练集和测试集
train_data, test_data = train_test_split(data)
# 对文本的词id和类别id进行编码
x_train = encode_sentences([content[0] for content in train_data], word_to_id)
y_train = to_categorical(encode_cate([content[1] for content in train_data], cat_to_id))
x_test = encode_sentences([content[0] for content in test_data], word_to_id)
y_test = to_categorical(encode_cate([content[1] for content in test_data], cat_to_id))

print('对序列做padding，保证是 samples*timestep 的维度')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('构建模型...')
model = TextAttBiRNN(maxlen, max_features, embedding_dims).get_model()
model.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])

print('Train...')
early_stopping = EarlyStopping(monitor='val_accuracy', patience=2, mode='max')
history = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=[early_stopping],
          validation_data=(x_test, y_test))

print('Test...')
result = model.predict(x_test)

总结：没训练完，带attention的BiLSTM训练更更慢了。。。

参考网址：
Keras实现用于文本分类的attention机制[Keras] 使用Keras编写自定义网络层（layer）

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。