文本分析程序 python 文本分类pytorch

转载

mob64ca14017c37 2023-09-03 09:41:15

文章标签 文本分析程序 python 数据集偏移量数据 文章分类 Python 后端开发

要查看图文并茂版教程，请移步： http://studyai.com/pytorch-1.4/beginner/text_sentiment_ngrams_tutorial.html

本教程演示如何在 torchtext 中使用文本分类数据集，包括

- AG_NEWS,
- SogouNews,
- DBpedia,
- YelpReviewPolarity,
- YelpReviewFull,
- YahooAnswers,
- AmazonReviewPolarity,
- AmazonReviewFull

此示例演示如何使用 TextClassification 数据集中的一个训练用于分类文本数据的监督学习算法。

使用ngrams加载数据

一个ngrams特征包(A bag of ngrams feature)被用来捕获一些关于本地词序的部分信息。在实际应用中，双字元(bi-gram)或三字元(tri-gram)作为词组比只使用一个单词(word)更有益处。例如：

"load data with ngrams"
Bi-grams results: "load data", "data with", "with ngrams"
Tri-grams results: "load data with", "data with ngrams"

TextClassification Dataset支持 ngrams 方法。通过将 ngrams 设置为2，数据集中的示例文本将是一个单字加上bi-grams字符串的列表。

import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 2
import os
if not os.path.isdir('./.data'):
    os.mkdir('./.data')
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

定义模型

模型由 EmbeddingBag 层和线性层组成（见下图）。 nn.EmbeddingBag 计算 embeddings 的 “bag” 的平均值。这里的文本条目有不同的长度。 nn.EmbeddingBag 此处不需要填充(padding)，因为文本长度以偏移量形式保存。

此外，由于 nn.EmbeddingBag 在线动态地累积了embeddings的平均值，因此 nn.EmbeddingBag 可以提高处理张量序列的性能和内存效率。 ../_images/text_sentiment_ngrams_model.png

import torch.nn as nn
import torch.nn.functional as F
class TextSentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

初始化模型

AG_NEWS 数据集有四个标签，因此类的数量是四个。

1 : World 2 : Sports 3 : Business 4 : Sci/Tec

The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels, which is four in AG_NEWS case.

VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUN_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUN_CLASS).to(device)

用于产生批量数据的函数

由于文本条目的长度不同，因此使用自定义函数 generate_batch() 生成数据batch和偏移量。此函数传递给 torch.utils.data.DataLoader.中的 collate_fn 。 collate_fn 的输入是一个具有batch_size大小的张量列表， collate_fn 函数将它们打包成一个 mini-batch 。注意这里必须确保 collate_fn 被声明为顶级定义的函数，这样可以确保每个线程(worker)都可以使用该功能。

原始数据batch输入中的文本条目被打包成一个列表，并作为 nn.EmbeddingBag 的输入连接为单个张量。偏移量(offsets)是分隔符的张量，表示文本张量中单个序列的起始索引。Label 是保存单个文本条目标签的张量。

def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    # torch.Tensor.cumsum returns the cumulative sum
    # of elements in the dimension dim.
    # torch.Tensor([1.0, 2.0, 3.0]).cumsum(dim=0)

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

定义训练和评估模型的函数

建议PyTorch用户使用 torch.utils.data.DataLoader ，它可以轻松地并行加载数据（这里有一个教程：数据加载）。我们在这里使用 DataLoader 加载AG_NEWS数据集并将其发送到模型进行训练/验证。

from torch.utils.data import DataLoader

def train_func(sub_train_):

    # 训练模型
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE, shuffle=True,
                      collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()

    # 调整学习率
    scheduler.step()

    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE, collate_fn=generate_batch)
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()

    return loss / len(data_), acc / len(data_)

划分数据集并运行模型

由于原始的 AG_NEWS 没有有效的数据集，我们将训练数据集分割为具有0.95（train）和0.05（valid）分割比的train/valid集。这里我们使用PyTorch核心库中的 torch.utils.data.dataset.random_split 函数。

CrossEntropyLoss 准则把 nn.LogSoftmax() 和 nn.NLLLoss() 组合进了一个类中。它在训练C类分类问题时非常有用。 SGD 作为优化器实现了随机梯度下降法。初始学习率设置为4.0。这里使用 StepLR 来调整各个回合(epoch)的学习率。

import time
from torch.utils.data.dataset import random_split
N_EPOCHS = 5
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = \
    random_split(train_dataset, [train_len, len(train_dataset) - train_len])

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

在GPU上运行模型并得到以下信息:

Epoch: 1 | time in 0 minutes, 11 seconds

Loss: 0.0263(train)     |       Acc: 84.5%(train)
Loss: 0.0001(valid)     |       Acc: 89.0%(valid)

Epoch: 2 | time in 0 minutes, 10 seconds

Loss: 0.0119(train)     |       Acc: 93.6%(train)
Loss: 0.0000(valid)     |       Acc: 89.6%(valid)

Epoch: 3 | time in 0 minutes, 9 seconds

Loss: 0.0069(train)     |       Acc: 96.4%(train)
Loss: 0.0000(valid)     |       Acc: 90.5%(valid)

Epoch: 4 | time in 0 minutes, 11 seconds

Loss: 0.0038(train)     |       Acc: 98.2%(train)
Loss: 0.0000(valid)     |       Acc: 90.4%(valid)

Epoch: 5 | time in 0 minutes, 11 seconds

Loss: 0.0022(train)     |       Acc: 99.0%(train)
Loss: 0.0000(valid)     |       Acc: 91.0%(valid)

使用测试数据集评估模型

print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

检查测试数据集的结果

Loss: 0.0237(test)      |       Acc: 90.5%(test)

在一条随机新闻上测试

使用目前为止最好的模型，测试一个高尔夫(golf)新闻。标签信息在此处提供。

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。