1. 模型原理

Transformer是一种基于全连接神经网络的编码器-解码器(encoder-decoder)架构实现,它由输入模块、编码器模块、解码器模型和输出模型四部分组成。其结构如下图所示:

chatgpt的文本生成模型 transformer文本生成_Self


图1 Transformer模型

关于Transformer的基础知识介绍,网上已有许多公开的资料。读者可自行查阅学习。本文默认大家已具备Transformer相关的基础知识,文本将讨论其中值得注意的四个问题:

  1. Transformer中有几种mask机制,它们是如何去除小批量样本padding操作引入的噪声的?
  2. 自注意力机制为什么不直接使用输入的特征向量,而要使用其线性变换的结果来计算自注意力?
  3. 什么是layer normal,为使用它而不使用batch normal?
  4. 绝对位置编码和相对位置编码

1.1 Transformer中的mask机制

Transformer中有三种不同的mask机制:inputs padding mask、look ahead mask和outputs padding mask

其中,inputs padding mask用于编码器和解码器中的Multi-Head Attention计算,通过与注意力得分矩阵元素对位相乘,使注意力全部集中于输入序列中有效位置的元素,从而消除输入序列padding值引入的噪声干扰

look ahead mask用于解码器中Mask Multi-Head Attention计算,它在负责消除标签序列padding值引入的噪声干扰的同时,也是将文本生成循环能转为批量进行的保障:它在输入全量的标签序列时,利用mask矩阵实现仅预测时刻之前的部分序列被解码器端可见的效果;outputs padding mask用于动态损失函数计算,负责消除残差结构直连项,绕开Self-Attention层引入的标签序列padding值噪声干扰。所以look ahead mask与outputs padding mask共同作用,彻底消除了标签序列padding值引入的噪声干扰

1.2 Self-Attention中输入特征向量线性变换的必要性

若不使用线性变换,则每个 token 对应的q,k,v向量都是一样的,那么此 token 对所有 token 所计算出的注意力分布中就只会集中于自己身上,对其他 token 的注意力会非常小。自注意力(Self-Attention)的作用就是筛选出其他 token 中有助于完成任务的特征信息,如果注意力分布不能关注到其他 token 的话,自注意力机制的作用也就丧失了。所以,自注意力机制需要使用输入特征向量的线性变换进行计算。

1.3 Layer Normalizationv.s. Batch Normalization

layer normalization 和 batch normalization 都是样本归一化方法,即:每一条样本都经过(x-mean) / std。

它们的区别在于归一化计算所使用的均值(mean)和方差(std)不同,如下图所示:

chatgpt的文本生成模型 transformer文本生成_人工智能_02


图2 LN与BN中均值、方差的计算方法

其中batch normalization更适合处理图像数据的归一化问题,而layer normalization更适合处理序列数据的归一化问题

1.4 位置编码

位置编码有两种方法:functional encoding和parametric encoding。其中,functional encoding:这个是指的是通过特定函数的方式,将输入的位置idx变换为embedding。而,parametric encoding:指的是通过embedding lookup的方式,让模型自己学习位置的embedding。

这两种方式效果相似,functional encoding可减少模型需学习参数,parametric encoding可以减少人工设定的超参数,所以样本量较少时推荐使用functional encoding方法,而样本量充足时优先使用parametric encoding方法成本更低。(Transformer使用的是functional absolute positional encoding,BERT使用的是 parametric absolute positional encoding。)

2. 代码实现

2.1 基于TensorFlow框架的实现

from EncoderDecoder import *
from Attention import Self_Attention

gpus = tf.config.experimental.list_physical_devices(device_type="GPU")
tf.config.experimental.set_visible_devices(devices=gpus[0], device_type="GPU")


# Transformer Encoder
class tfEncoder(layers.Layer):
    def __init__(self, hidden_num, head_num):
        super(tfEncoder, self).__init__()
        self.self_Attention = Self_Attention(hidden_num, head_num, mask_type='self_mask')
        self.LayerNorm = layers.LayerNormalization()
        self.Hidden = tf.keras.Sequential([
            layers.Dense(hidden_num, activation='relu')
        ])

    def call(self, inputs, input_len):
        """:param
        inputs:     inputs is the embedding sequence of input sentences, and its' shape is [batch_size, time_steps, hidden_num]
        input_len:  input_len is the valid length of input sentences, and its' shape is [batch_size]
        """
        self_attention = self.self_Attention(inputs, seq_len=input_len)
        layer_norm_1 = self.LayerNorm(self_attention + inputs)
        hidden = self.Hidden(layer_norm_1)
        layer_norm_2 = self.LayerNorm(hidden + layer_norm_1)
        return layer_norm_2


# Transformer Decoder
class tfDecoder(layers.Layer):
    def __init__(self, hidden_num, head_num):
        super(tfDecoder, self).__init__()
        self.Future_Attention = Self_Attention(hidden_num, head_num, mask_type='future_self_mask')
        self.LayerNorm = layers.LayerNormalization()
        self.Attention = Self_Attention(hidden_num, head_num, mask_type='mask')
        self.Hidden = tf.keras.Sequential([
            layers.Dense(hidden_num, activation='relu')
        ])

    def call(self, inputs, input_len, enc_hiddens, enc_input_len, training):
        """
        **Train**
        :inputs:           inputs is label sequences, and its' shape is [batch_size*time_steps_label, time_steps_label, hidden_num]
        :input_len:        input_len is the valid length of label sequences, and its' shape is [batch_size](单向掩码函数中:会自动将其复制 time_steps 份)
        :enc_hiddens:      [batch_size*time_steps_label, time_steps_input_sentences, hidden_num]
        :enc_input_len:    enc_input_len is the valid length of input sentences, and its' shape is [batch_size]
        :outputs:          outputs is predicted label sequences, and its' shape is [batch_size*time_steps_label, time_steps_label, hidden_num]
        ****
        **Predict**
        :inputs:           inputs is label sequences, and its' shape is [batch_size, time_steps_label, hidden_num]
        :enc_hiddens:      [batch_size, time_steps_input_sentences, hidden_num]
        :outputs:          outputs is predicted label sequences, and its' shape is [batch_size, time_steps_label, hidden_num]
        """
        if input_len is None and training is not False:
            raise ValueError("请检测标签序列长度参数:input_len")

        #
        future_self_attention = self.Future_Attention(inputs, seq_len=input_len)
        layer_norm_1 = self.LayerNorm(future_self_attention + inputs)

        # 此处attention中的mask矩阵,也相当于另一种意义上的关于Encoder输出结果的多头注意力机制的mask矩阵,两者任务内容相同
        attention = self.Attention(
            layer_norm_1, keys=enc_hiddens, values=enc_hiddens, seq_len=enc_input_len
            , training=training
        )
        layer_norm_2 = self.LayerNorm(attention + layer_norm_1)
        hidden = self.Hidden(layer_norm_2)
        layer_norm_3 = self.LayerNorm(hidden + layer_norm_2)
        return layer_norm_3


# Transform Model
class Transformer(EncoderDecoder):
    def __init__(self, hidden_num, head_num, encd_layers, decd_layers,
                vocab_size, word2index, index2word, **kwargs):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.word2index = word2index
        self.index2word = index2word
        #
        self.Embed = layers.Embedding(vocab_size, hidden_num)
        self.PosEncod = self.Position_Encoder
        self.Encoders = [tfEncoder(hidden_num, head_num) for _ in range(encd_layers)]
        self.Decoders = [tfDecoder(hidden_num, head_num) for _ in range(decd_layers)]
        self.Output = tf.keras.Sequential([
            layers.Dense(vocab_size, activation='softmax')
        ])

    def call(self, X, y, X_len, y_len, training):
        """
        :param X:          it is the input sentences, and its' shape is [batch_size, time_steps_X]
        :param y:          it is the label sequence, and its' shape is [batch_size, time_steps_y]
        :param X_len:      it is the valid length of input sentences, and its' shape is [batch_size, time_steps_X]
        :param y_len:      it is the valid length of label sequence, and its' shape is [batch_size, time_steps_X]
        :param training:   its' value is True or False
        :return:
        """
        # Input
        inputs = self.Input(X)

        # Encoder
        encoder = inputs
        for Encoder in self.Encoders:
            encoder = Encoder(encoder, input_len=X_len)

        # Decoder
        if training is True:
            """ 强制教学,批量进行 Decoder
            The shape of encoder_outputs and decoder_inputs is converted to [batch_size*time_steps, time_steps, hidden_num] from [batch_size, time_steps, hidden_num].
            """
            bos = tf.constant([self.word2index['<bos>']] * y.shape[0], shape=[y.shape[0], 1], dtype=y.dtype)
            y = tf.concat([bos, y[:, :-1]], axis=1)  # 去除<eos>标签
            decoder = self.Input(y)
            #
            encoder = tf.reshape(
                tf.tile(tf.expand_dims(encoder, axis=1), [1, decoder.shape[1], 1, 1])
                , [encoder.shape[0]*decoder.shape[1], encoder.shape[1], encoder.shape[2]]
            )
            decoder = tf.reshape(
                tf.tile(tf.expand_dims(decoder, axis=1), [1, decoder.shape[1], 1, 1])
                , [decoder.shape[0]*decoder.shape[1], decoder.shape[1], decoder.shape[2]]
            )
            for Decoder in self.Decoders:
                decoder = Decoder(decoder, input_len=y_len, enc_hiddens=encoder, enc_input_len=X_len, training=training)
        else:
            decoder = self.Input(y)
            for Decoder in self.Decoders:
                decoder = Decoder(decoder, input_len=None, enc_hiddens=encoder, enc_input_len=X_len, training=training)

        # Output
        outputs = self.Output(decoder)
        if training is True:
            """ 提取并行计算产生的循环预测标签 """
            Diagonal_mask = tf.ones([outputs.shape[1], outputs.shape[1]], dtype=outputs.dtype)
            Diagonal_mask = tf.expand_dims(tf.linalg.band_part(Diagonal_mask, 0, 0), axis=-1)
            Diagonal_mask = tf.tile(Diagonal_mask, [y.shape[0], 1, 1])
            outputs = tf.reduce_sum(
                tf.reshape(outputs * Diagonal_mask, [y.shape[0], y.shape[1], y.shape[1], self.vocab_size])
                , axis=1)
        y_hat = tf.argmax(outputs, axis=-1, output_type=X.dtype)
        return y_hat, outputs

    # 输入层前向传播过程
    def Input(self, X):
        embed = self.Embed(X)
        pos_encod = self.PosEncod(embed)  # 这其实是位置编码与词向量编码共享同一个词向量层
        return embed + pos_encod

    #
    def Position_Encoder(self, inputs, max_len=512):
        """
        据文献阐述,Transformer结构的模型最大可处理的文本长度是512个token,更长文本需要用XLNet模型才可很好处理;
        所以此处设置max_len=512够用了,再长就超出模型处理能力了。
        """
        batch_size, steps, hiddens = list(inputs.shape)
        if steps > max_len: raise ValueError('输入序列超出最大编码长度')
        #
        pos_encoding = np.zeros((1, max_len, hiddens), dtype=np.float32)    # The dtype of np.zeros Default is 'numpy.float64', which would raise error because it is expected to be a float32.
        pos = np.arange(max_len, dtype=np.float32).reshape(-1, 1) / np.power(
            10000, np.arange(0, hiddens, 2, dtype=np.float32) / hiddens)
        #
        pos_encoding[:, :, 0::2] = np.sin(pos)  # 奇数位置
        pos_encoding[:, :, 1::2] = np.cos(pos)  # 偶数位置
        return tf.tile(tf.constant(pos_encoding[:, :steps, :]), [batch_size, 1, 1])

2.2 基于Pytorch框架的实现

from .Attention import *
from .EncoderDecoder import *


# Transform Encoder
class Encoder(nn.Module):
    def __init__(self, hidden_dim, head_num, hidden_act=nn.ReLU(), device='cpu'):
        super(Encoder, self).__init__()
        self.device = device
        self.self_Attention = Self_Attention(hidden_dim, head_num).to(self.device)
        self.LayerNorm = nn.LayerNorm(hidden_dim).to(self.device)
        self.Hidden = nn.Linear(hidden_dim, hidden_dim, device=self.device)
        self.HiddenAct = hidden_act.to(self.device)

    def forward(self, inputs, input_mask):
        """:param
        inputs:     inputs is the embedding sequence of input sentences, and its' shape is [batch_size, time_steps, hidden_dim]
        input_len:  input_len is the valid length of input sentences, and its' shape is [batch_size]
        """
        self_attention = self.self_Attention(inputs, self_mask=input_mask)
        layer_norm_1 = self.LayerNorm(self_attention + inputs)
        hidden = self.HiddenAct(
            self.Hidden(layer_norm_1)
        )
        layer_norm_2 = self.LayerNorm(hidden + layer_norm_1)
        return layer_norm_2


# Transform Decoder
class Decoder(nn.Module):
    def __init__(self, hidden_dim, head_num, hidden_act=nn.ReLU(), device='cpu'):
        super(Decoder, self).__init__()
        #
        self.device = device
        #
        self.Future_Attention = Self_Attention(hidden_dim, head_num).to(self.device)
        self.LayerNorm = nn.LayerNorm(hidden_dim).to(self.device)
        self.Attention = Self_Attention(hidden_dim, head_num).to(self.device)
        self.Hidden = nn.Linear(hidden_dim, hidden_dim, device=self.device)
        self.HiddenAct = hidden_act.to(self.device)

    def forward(self, inputs, enc_hiddens, future_mask, enc_input_mask):
        """
        **Train**
        :inputs:           inputs is label sequences, and its' shape is [batch_size * y_time_steps, y_time_steps, hidden_dim]
        :input_len:        input_len is the valid length of label sequences, and its' shape is [batch_size](单向掩码函数中:会自动将其复制 time_steps 份)
        :enc_hiddens:      [batch_size*y_time_steps, X_time_steps, hidden_dim]
        :enc_input_len:    enc_input_len is the valid length of input sentences, and its' shape is [batch_size]
        :outputs:          outputs is predicted label sequences, and its' shape is [batch_size * y_time_steps, y_time_steps, hidden_dim]
        ****
        **Predict**
        :inputs:           inputs is label sequences, and its' shape is [batch_size, y_time_steps, hidden_dim]
        :enc_hiddens:      [batch_size, X_time_steps, hidden_dim]
        :outputs:          outputs is predicted label sequences, and its' shape is [batch_size, y_time_steps, hidden_dim]
        """
        batch_size, y_time_steps, _ = inputs.shape
        #
        future_attention = self.Future_Attention(inputs, self_mask=future_mask)
        layer_norm_1 = self.LayerNorm(future_attention + inputs)

        # 此处attention中的mask矩阵,也相当于另一种意义上的关于Encoder输出结果的多头注意力机制的mask矩阵,两者任务内容相同
        attention = self.Attention(
            layer_norm_1, keys=enc_hiddens, values=enc_hiddens, attention_mask=enc_input_mask
        )
        layer_norm_2 = self.LayerNorm(attention + layer_norm_1)

        #
        hidden = self.Hidden(layer_norm_2)
        layer_norm_3 = self.LayerNorm(hidden + layer_norm_2)
        return layer_norm_3


# Transform Model
class Transform(EncoderDecoder):
    def __init__(self, hidden_dim, head_num, encd_layers, decd_layers,
                vocab_size, word2index, index2word, device='cpu'):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.word2index = word2index
        self.index2word = index2word
        self.device = device
        #
        self.Embed = nn.Embedding(vocab_size, self.hidden_dim).to(self.device)
        self.PosEncod = self.Position_Encoder
        self.Encoders = nn.Sequential(
            Encoder(self.hidden_dim, head_num, device=self.device)
        )
        if encd_layers > 1:
            for i in range(encd_layers - 1):
                self.Encoders.add_module(
                    str(i + 1), Encoder(self.hidden_dim, head_num, device=self.device)
                )
        self.Decoders = nn.Sequential(
            Decoder(self.hidden_dim, head_num, device=self.device)
        )
        if decd_layers > 1:
            for i in range(decd_layers - 1):
                self.Decoders.add_module(
                    str(i + 1), Decoder(self.hidden_dim, head_num, device=self.device)
                )
        self.Output = nn.Linear(hidden_dim, vocab_size, device=self.device)

    # forward
    def forward(self, X, y=None, input_mask=None, future_mask=None, target_time_steps: int = None):
        """
        X:            it is the input sentences, and its' shape is [batch_size, time_steps_X].
        y:            it is the label sequence, and its' shape is [batch_size, time_steps_y].
        input_mask:   it is padding mask matrix of input sentences(X), and its' shape is [batch_size, time_steps_X].
        future_mask:  单向掩码矩阵,用于Decoder强制教学, 其形状是 [batch_size * (time_steps_y+1), time_steps_y + 1, time_steps_y + 1].
        :return:      shape is []
        """
        # Input
        _, X_time_steps = X.shape
        encoder = self.Input(X)
        del X;  gc.collect()
        # Encoder
        for Encoder in self.Encoders:
            encoder = Encoder(
                encoder, input_mask=input_mask
            )
        del Encoder;  gc.collect()
        # Train or Predict
        if future_mask is not None:
            """ 强制教学Train,批量进行 Decoder
            The shape of encoder_outputs and decoder_inputs is converted to [batch_size * time_steps_y, time_steps_y, hidden_dim]
            from [batch_size, time_steps_y + 1, hidden_dim].
            """
            #
            batch_size, y_time_steps = future_mask.shape
            y = torch.reshape(
                torch.tile(
                    torch.unsqueeze(y, 1), [1, y_time_steps, 1]
                )
                , [batch_size * y_time_steps, y_time_steps]
            )
            encoder = torch.reshape(
                torch.tile(
                    torch.unsqueeze(encoder, 1), [1, y_time_steps, 1, 1]
                )
                , [batch_size * y_time_steps, X_time_steps, self.hidden_dim]
            )
            input_mask = torch.reshape(
                torch.tile(
                    torch.unsqueeze(input_mask, 1), [1, y_time_steps, 1]
                )
                , [batch_size * y_time_steps, X_time_steps]
            )
            # Decoder
            decoder = self.Input(y)
            for Decoder in self.Decoders:
                decoder = Decoder(
                    decoder, enc_hiddens=encoder, future_mask=future_mask, enc_input_mask=input_mask
                )
            del Decoder;  gc.collect()
            # Output
            outputs = F.softmax(
                self.Output(decoder)
            )
            outputs = torch.reshape(
                outputs, [self.batch_size, self.y_time_steps, self.y_time_steps, -1]
            )
        else:
            """ Predict
            y:  its' shape is [batch_size, 1], <bos>
            """
            outputs = []
            for i in range(target_time_steps):
                # Decoder
                decoder = self.Input(y)
                for Decoder in self.Decoders:
                    decoder = Decoder(
                        decoder, enc_hiddens=encoder, future_mask=None, enc_input_mask=input_mask
                    )
                # Output
                output = F.softmax(
                    self.Output(decoder[:, -1])
                )
                y_score, y_hat = torch.max(output[:, -1], dim=-1, keepdim=True)
                y = torch.concat([y, y_hat], dim=1)
                outputs.append(output[:, -1:])
            outputs = torch.concat(outputs, dim=1)
        return outputs

    #
    def Input(self, X):
        embed = self.Embed(X)
        pos_encod = self.PosEncod(embed)  # 这其实是位置编码与词向量编码共享同一个词向量层
        return embed + pos_encod

    #
    def Position_Encoder(self, inputs, max_len=512):
        """
        据文献阐述,Transform结构的模型最大可处理的文本长度是512个token,更长文本需要用XLNet模型才可很好处理;
        所以此处设置max_len=512够用了,再长就超出模型处理能力了。
        """
        batch_size, steps, hiddens = inputs.shape
        if steps > max_len: raise ValueError('输入序列超出最大编码长度')
        #
        pos_encoding = np.zeros((1, max_len, hiddens), dtype=np.float32)    # The dtype of np.zeros Default is 'numpy.float64', which would raise error because it is expected to be a float32.
        pos = np.arange(max_len, dtype=np.float32).reshape(-1, 1) / np.power(
            10000, np.arange(0, hiddens, 2, dtype=np.float32) / hiddens)
        #
        pos_encoding[:, :, 0::2] = np.sin(pos)  # 奇数位置
        pos_encoding[:, :, 1::2] = np.cos(pos)  # 偶数位置
        pos_encoding = torch.from_numpy(pos_encoding[:, :steps, :]).to(self.device)
        return torch.tile(pos_encoding, [batch_size, 1, 1])