Llama2基于pytorch lstm pytorch_语言模型


最近做了一个神经网络来实现语言模型的作业,基于Pytorch,不是特别难的一个任务。但是做的过程中发现网上各种写法大多是基于 Pytorch 0.4 或更老的版本来实现的。在这个任务里,想尝试一些比较优雅的写法,主要是用了 Pytorch dataloader 来比较方便的实现batch的数据结构。

首先明确模型训练的任务是基于前一个单词预测后一个单词(或者基于当前的词预测之后的单词),以此来使得语言模型拥有识别词法或者句法的能力。另一个需要在构建前明确的是模型的输入和输出,输入对应每一个原始的句子,输出对应原始句子之后一个时间步(timestep)得到的句子,这里举一个例子:


Llama2基于pytorch lstm pytorch_语言模型_02

输入输出样式


大概明确了任务之后,就可以展开了,这里我先选取了一个Corpus,这里选用什么语料都没关系,也可以选取wikipedia的text。我这里先用莎士比亚的《哈姆莱特》作为语料进行训练。

  1. 2. 根据语料构建 dataset 和 dataloader(data_loader.py)
def tokenize(lines):
    END = '</s>' # 补全句子末尾结束符
    sents = [str.lower(str(line)).split() + [END] for line in lines]
    return sents


# 可以进行适当的长度的选取
def readin(path):
    list_lines = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            if len(line.strip()) == 0:  # 过滤空的行
                continue
            words = line.split() + ['</s>']
            # if 20 < len(words) < 500:
            list_lines.append(line)
    return list_lines[:100]


# 构建输入,输出
def construct_input_output(sents):
    input_sents = [words[:-1] for words in sents]
    output_sents = [words[1:] for words in sents]
    return input_sents, output_sents


def process(path):
    list_of_lines = readin(path)
    lines_tokens = tokenize(list_of_lines)
    input, output = construct_input_output(lines_tokens)
    return input, output


2. 根据语料构建dataset和dataloader (data_loader.py)

这一步是为了更好的实现 batch,相比于网上各种在训练阶段人为划定的方式,这种方式更加友好高效且可读。另一个重点就是实现 dataset 以及我们需要的词表 vocab 。我们根据词来获得词表序号(word2index)和根据词表序号来还原词(index2word),用 vectorize 和 unvectorize 2个方法来实现。最重要的是要实现 dataset 中获取 item 和长度 2个预定义方法,基于 from torch.utils.data import Dataset,这一部分可以参考 Pytorch 文档,一定需要实现 __getitem_和__len__2个方法。

Data Loading and Processing Tutorialpytorch.org


Llama2基于pytorch lstm pytorch_全连接_03


import torch
from torch.nn import functional as F
from torch.utils.data import Dataset
from gensim.corpora.dictionary import Dictionary

class LangDataset(Dataset):
    def __init__(self, src_sents, trg_sents, max_len=-1):
        self.src_sents = src_sents
        self.trg_sents = trg_sents

        # Create the vocabulary for both the source and target.
        self.vocab = Dictionary(src_sents + trg_sents)

        # Patch the vocabularies and add the <pad> and <unk> symbols.
        special_tokens = {'<pad>': 0, '<unk>': 1, '</s>': 2}
        self.vocab.patch_with_special_tokens(special_tokens)

        # Keep track of how many data points.
        self._len = len(src_sents)

        if max_len < 0:
            # If it's not set, find the longest text in the data.
            max_src_len = max(len(sent) for sent in src_sents)
            self.max_len = max_src_len
        else:
            self.max_len = max_len

    def pad_sequence(self, vectorized_sent, max_len):
        # To pad the sentence:
        # Pad left = 0; Pad right = max_len - len of sent.
        pad_dim = (0, max_len - len(vectorized_sent))
        return F.pad(vectorized_sent, pad_dim, 'constant')

    def __getitem__(self, index):
        vectorized_src = self.vectorize(self.vocab, self.src_sents[index])
        vectorized_trg = self.vectorize(self.vocab, self.trg_sents[index])
        return {'x': self.pad_sequence(vectorized_src, self.max_len),
                'y': self.pad_sequence(vectorized_trg, self.max_len),
                'x_len': len(vectorized_src),
                'y_len': len(vectorized_trg)}

    def __len__(self):
        return self._len

    def vectorize(self, vocab, tokens):
        """
        :param tokens: Tokens that should be vectorized.
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx
        # Lets just cast list of indices into torch tensors directly =)
        return torch.tensor(vocab.doc2idx(tokens, unknown_word_index=1))

    def unvectorize(self, vocab, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [vocab[i] for i in indices]


这里运用了 gensim 最新的 pad_sequence 来增加词表中特殊标定如 pad,unk 和 </s>。进行到这里,基本上已经可以将原始的词转化为是词表序号的序列。接下来就可以构建模型了。

3. 构建模型,使用LSTM (model.py)

熟悉RNN类的其实也没什么可多说的,三层结构嵌入层-LSTM层-全连接层,eval如下:

RNNLM(

(embed): Embedding(3910, 300)

(lstm): LSTM(300, 1024, num_layers=2, batch_first=True)

(fc): Linear(in_features=1024, out_features=391, bias=True)

)

这里采用了300维的词向量,2层LSTM,hidden_size为1024层,代码如下:


import torch.nn as nn
from torch.nn import functional as F
class RNNLM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1, dropout_p=0.5):
        super(RNNLM, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
        self._dropout_p = dropout_p

    def forward(self, x, h):
        # Embed word ids to vectors
        x = self.embed(x)
        # Forward propagate LSTM
        out, h = self.lstm(x, h)
        batch_size, seq_size, hidden_size = out.shape

        # Reshape output to (batch_size*sequence_length, hidden_size)
        out = out.contiguous().view(batch_size * seq_size, hidden_size)

        # apply dropout
        out = self.fc(F.dropout(out, p=self._dropout_p))
        out_feat = out.shape[-1]
        out = out.view(batch_size, seq_size, out_feat)
        return out, h


我们这里只是拿每一步的输出,因此需要拿到每一个时间步的output就好了,对hidden state不做什么事情。因为引入了batch,所以在全连接的时候需要对batch和sequence length相乘不然无法输入到全连接层,通过全连接层FC之后,我们再把 output 还原成 (batch, sequence, feature) 的形式。

4. 训练过程 (train.py)

针对训练过程,大致的过程是

  • 读入输入输出x和y,挂载到 GPU上(不挂GPU,那就是乌龟速度了),获得词表长度作为embedding层的初始化参数。
  • 初始化模型,model = RNNLM (一堆参数们( ̄▽ ̄)~*)。
  • 定义优化器optimizer,这里用的是 adam(可以获得较快收敛吧,SGD什么的随意都行)。
  • 定义损失函数计算loss,这里用的是 cross entropy loss,有一点区别在于因为我们对句子进行了不定长的pad补齐,因此需要对pad的词位进行mask来忽略它的影响。另外语言模型比较好的衡量loss的方式就是perplexity,perplexity和loss的关系就是一个e为底的指数关系,推导可以参考(Perplexity Vs Cross-entropy)。
  • 记录每一个epoch下的损失和准确率。
  • 定义好模型保存的路径。
# 对输出batch进行转化
def normalize_sizes(y_pred, y_true):
    if len(y_pred.size()) == 3:
        y_pred = y_pred.contiguous().view(-1, y_pred.size(2))
    if len(y_true.size()) == 2:
        y_true = y_true.contiguous().view(-1)
    return y_pred, y_true


# 定义计算的准确率
def compute_accuracy(y_pred, y_true, mask_index=0):
    y_pred, y_true = normalize_sizes(y_pred, y_true)

    _, y_pred_indices = y_pred.max(dim=1)

    correct_indices = torch.eq(y_pred_indices, y_true).float()
    valid_indices = torch.ne(y_true, mask_index).float()

    n_correct = (correct_indices * valid_indices).sum().item()
    n_valid = valid_indices.sum().item()

    return n_correct / n_valid * 100


# 定义序列的损失函数
def sequence_loss(y_pred, y_true, mask_index=0):
    y_pred, y_true = normalize_sizes(y_pred, y_true)
    return F.cross_entropy(y_pred, y_true, ignore_index=mask_index)


训练过程:


# 训练的过程
def train():
    lang_dataset, dataloader, input_sents, output_sents = 
        get_dataset_dataloader(path, config.batch_size)
    vocab_size = len(lang_dataset.vocab)

    model = RNNLM(
        vocab_size,
        config.embed_size,
        config.hidden_size,
        config.num_layers,
        config.dropout_p
    )

    model = model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)

    train_loss = []
    train_acc = []
    # initialize the loss
    best_loss = 9999999.0
    for epoch in range(config.num_epochs):
        # 初始化 hidden state
        states = (Variable(torch.zeros(config.num_layers, config.batch_size, config.hidden_size)).to(device),
                  Variable(torch.zeros(config.num_layers, config.batch_size, config.hidden_size)).to(device))

        running_loss = 0.0
        running_acc = 0.0
        model.train()
        batch_index = 0
        for data_dict in tqdm(dataloader):
            batch_index += 1
            optimizer.zero_grad()
            x = data_dict['x'].to(device)
            y = data_dict['y'].to(device)
            y_pred, states = model(x, states)
            loss = sequence_loss(y_pred, y)
            loss.backward(retain_graph=True)
            optimizer.step()
            running_loss += (loss.item() - running_loss) / batch_index
            acc_t = compute_accuracy(y_pred, y)
            running_acc += (acc_t - running_acc) / (batch_index + 1)
        print('Epoch = %d, Train loss = %f, Train accuracy = %f, Train perplexity = %f' % (
            epoch, running_loss, running_acc, math.exp(running_loss)))
        train_loss.append(running_loss)
        train_acc.append(running_acc)
        if running_loss < best_loss:
            # 模型保存
            torch.save(model, './model_save/best_model_epoch%d_loss_%f.pth' % (epoch, loss))
            best_loss = running_loss
        print(' '.join(generate(model, lang_dataset, 'the')))

    return train_loss, train_acc


然后就基本上等着训练就可以啦。

另外还写了一个生成器,拿已经训练好的模型,手工输入第一个单词,然后来看它能输出一个什么鬼的句子,生成过程中,遇到结束标志符就停止。

5. 根据第一个单词,输出生成句子 (generation.py)


def generate(input_word, dataset_p, model_p, word_len=100, temperature=1.0):
    model = torch.load(model_p)
    dataset = get_dataset(dataset_p)
    model.eval()
    hidden = (Variable(torch.zeros(config.num_layers, 1, config.hidden_size)).to(device),
              Variable(torch.zeros(config.num_layers, 1, config.hidden_size)).to(device))  # batch_size为1
    start_idx = dataset.vectorize(dataset.vocab, [input_word])
    input_tensor = torch.stack([start_idx] * 1)
    input = input_tensor.to(device)
    word_list = [input_word]
    for i in range(word_len):  # generate word by word

        output, hidden = model(input, hidden)
        word_weights = output.squeeze().data.div(temperature).exp().cpu()
        # get the 1st biggest prob index
        word_idx = torch.multinomial(word_weights, 1)[0]
        if word_idx == 2:
            break
        input.data.fill_(word_idx)  # put new word into input
        word = dataset.unvectorize(dataset.vocab, [word_idx.item()])
        word_list.append(word[0])
    return word_list


这里我用小数据集进行训练,然后用the 作为第一个词来进行生成,截取了不同 epoch 下的一点生成结果:


Epoch = 4, Train loss = 4.982013, Train accuracy = 8.298720, Train perplexity = 145.767445
the voltemand, cold, of buried hamlet ***

Epoch = 11, Train loss = 2.527130, Train accuracy = 28.159594, Train perplexity = 12.517534
the terms terms use of of hamlet, the the states world late

Epoch = 15, Train loss = 1.226519, Train accuracy = 50.120127, Train perplexity = 3.409342
the tragedy of the prince of hamlet’s


可以看到随着训练的epoch增加,perplexity和loss都在降低,预测词的准确率也在上升,生成的句子语义化程度也越来越高。基本上还是可以得到不错的结果的。

是一款不错的练手任务,可以用不同的语料来玩一下。。。

完整代码等交了作业再放。


Llama2基于pytorch lstm pytorch_语言模型_04