Kaggle Jigsaw文本分类比赛方案总结

原创

wx6464351503832 2023-05-17 17:16:24 ©著作权

©著作权归作者所有：来自51CTO博客作者wx6464351503832的原创作品，请联系作者获取转载授权，否则将追究法律责任

Kaggle Jigsaw文本分类比赛方案总结_深度学习

Kaggle Jigsaw文本分类比赛方案总结_神经网络_02

以下资源来自国内外选手分享的资源与方案，非常感谢他们的无私分享

比赛简介

一年一度的jigsaw有毒评论比赛开赛了，这次比赛与前两次举办的比赛不同，以往比赛都是英文训练集和测试集，但是这次的比赛确是训练集是前两次比赛的训练集的一个组合，验证集则是三种语言分别是es(西班牙语)、it(意大利语)、tr(土耳其语)，测试集语言则是六种语言分别是es(西班牙语)、it(意大利语)、tr(土耳其语)，ru(俄语)、pt(葡萄牙语)、fr(法语)。
--kaggle的Jigsaw多语言评论识别全球top15比赛心得分享

Kaggle Jigsaw文本分类比赛方案总结_神经网络_03

题目分析

这个比赛是一个文本分类的比赛，这个比赛目标是在给定文本中判断是否为恶意评论即01分类。训练数据还给了其他多列特征，包括一些敏感词特征还有一些其他指标评价的得分特征。测试集没有这些额外的特征只有文本数据。

通过比赛的评价指标可以看出来，这个比赛不仅仅是简单的01分类的比赛。这个比赛不仅关注分类正确，还关注于在预测结果中不是恶意评论中包含敏感词和是恶意评论中不包含敏感词两部分数据的得分。所以我们需要关注一下这两类的数据。可以考虑给这两类的数据赋予更高的权重，更方便模型能够准确的对这些数据预测正确。

随着时间变化的不同Race言论分布：

Kaggle Jigsaw文本分类比赛方案总结_人工智能_04

文本统计特征如下：

Kaggle Jigsaw文本分类比赛方案总结_图像识别_05

词云展示

Kaggle Jigsaw文本分类比赛方案总结_人工智能_06

更多有趣的数据分析大家可以看下： https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw

第三名方案解析

代码仓库：https://github.com/sakami0000/kaggle_jigsaw
方案帖子:https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/97471#latest-582610

模型1 LstmGruNet

模型如其名，作者主要基于LSTM以及GRU两种序列循环神经网络搭建了文本分类模型

class LstmGruNet(nn.Module):

    def __init__(self, embedding_matrices, num_aux_targets, embedding_size=256, lstm_units=128,
                 gru_units=128):
        super(LstmGruNet, self).__init__()
        self.embedding = ProjSumEmbedding(embedding_matrices, embedding_size)
        self.embedding_dropout = SpatialDropout(0.2)

        self.lstm = nn.LSTM(embedding_size, lstm_units, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_units * 2, gru_units, bidirectional=True, batch_first=True)

        dense_hidden_units = gru_units * 4
        self.linear1 = nn.Linear(dense_hidden_units, dense_hidden_units)
        self.linear2 = nn.Linear(dense_hidden_units, dense_hidden_units)

        self.linear_out = nn.Linear(dense_hidden_units, 1)
        self.linear_aux_out = nn.Linear(dense_hidden_units, num_aux_targets)

    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.embedding_dropout(h_embedding)

        h1, _ = self.lstm(h_embedding)
        h2, _ = self.gru(h1)

        # global average pooling
        avg_pool = torch.mean(h2, 1)
        # global max pooling
        max_pool, _ = torch.max(h2, 1)

        h_conc = torch.cat((max_pool, avg_pool), 1)
        h_conc_linear1 = F.relu(self.linear1(h_conc))
        h_conc_linear2 = F.relu(self.linear2(h_conc))

        hidden = h_conc + h_conc_linear1 + h_conc_linear2

        result = self.linear_out(hidden)
        aux_result = self.linear_aux_out(hidden)
        out = torch.cat([result, aux_result], 1)

        return out

Kaggle Jigsaw文本分类比赛方案总结_图像识别_07

模型2 LstmCapsuleAttenModel

该模型有递归神经网络、胶囊网络以及注意力神经网络搭建。

class LstmCapsuleAttenModel(nn.Module):

    def __init__(self, embedding_matrix, maxlen=200, lstm_hidden_size=128, gru_hidden_size=128,
                 embedding_dropout=0.2, dropout1=0.2, dropout2=0.1, out_size=16,
                 num_capsule=5, dim_capsule=5, caps_out=1, caps_dropout=0.3):
        super(LstmCapsuleAttenModel, self).__init__()

        self.embedding = nn.Embedding(*embedding_matrix.shape)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = nn.Dropout2d(embedding_dropout)

        self.lstm = nn.LSTM(embedding_matrix.shape[1], lstm_hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_hidden_size * 2, gru_hidden_size, bidirectional=True, batch_first=True)
        
        self.lstm_attention = Attention(lstm_hidden_size * 2, maxlen=maxlen)
        self.gru_attention = Attention(gru_hidden_size * 2, maxlen=maxlen)
        
        self.capsule = Capsule(input_dim_capsule=gru_hidden_size * 2,
                               num_capsule=num_capsule,
                               dim_capsule=dim_capsule)
        self.dropout_caps = nn.Dropout(caps_dropout)
        self.lin_caps = nn.Linear(num_capsule * dim_capsule, caps_out)

        self.norm = nn.LayerNorm(lstm_hidden_size * 2 + gru_hidden_size * 6 + caps_out)
        self.dropout1 = nn.Dropout(dropout1)
        self.linear = nn.Linear(lstm_hidden_size * 2 + gru_hidden_size * 6 + caps_out, out_size)
        self.dropout2 = nn.Dropout(dropout2)
        self.out = nn.Linear(out_size, 1)
        
    def apply_spatial_dropout(self, h_embedding):
        h_embedding = h_embedding.transpose(1, 2).unsqueeze(2)
        h_embedding = self.embedding_dropout(h_embedding).squeeze(2).transpose(1, 2)
        return h_embedding

    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.apply_spatial_dropout(h_embedding)

        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        
        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)
        
        content3 = self.capsule(h_gru)
        batch_size = content3.size(0)
        content3 = content3.view(batch_size, -1)
        content3 = self.dropout_caps(content3)
        content3 = torch.relu(self.lin_caps(content3))

        avg_pool = torch.mean(h_gru, 1)
        max_pool, _ = torch.max(h_gru, 1)

        conc = torch.cat((h_lstm_atten, h_gru_atten, content3, avg_pool, max_pool), 1)
        conc = self.norm(conc)
        conc = self.dropout1(conc)
        conc = torch.relu(conc)
        conc = self.linear(conc)
        conc = self.dropout2(conc)
        out = self.out(conc)

        return out

Kaggle Jigsaw文本分类比赛方案总结_网络_08

模型3 LstmConvModel

该模型有LSTM和Convolutional Neural Network搭建

class LstmConvModel(nn.Module):

    def __init__(self, embedding_matrix, lstm_hidden_size=128, gru_hidden_size=128, n_channels=64,
                 embedding_dropout=0.2, out_size=20, out_dropout=0.1):
        super(LstmConvModel, self).__init__()

        self.embedding = nn.Embedding(*embedding_matrix.shape)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = nn.Dropout2d(0.2)

        self.lstm = nn.LSTM(embedding_matrix.shape[1], lstm_hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(lstm_hidden_size * 2, gru_hidden_size, bidirectional=True, batch_first=True)
        self.conv = nn.Conv1d(gru_hidden_size * 2, n_channels, 3, padding=2)
        nn.init.xavier_uniform_(self.conv.weight)

        self.linear = nn.Linear(n_channels * 2, out_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(out_dropout)
        self.out = nn.Linear(out_size, 1)

    def apply_spatial_dropout(self, h_embedding):
        h_embedding = h_embedding.transpose(1, 2).unsqueeze(2)
        h_embedding = self.embedding_dropout(h_embedding).squeeze(2).transpose(1, 2)
        return h_embedding

    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = self.apply_spatial_dropout(h_embedding)

        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        h_gru = h_gru.transpose(2, 1)
        conv = self.conv(h_gru)

        conv_avg_pool = torch.mean(conv, 2)
        conv_max_pool, _ = torch.max(conv, 2)

        conc = torch.cat((conv_avg_pool, conv_max_pool), 1)
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        out = self.out(conc)

        return out

Kaggle Jigsaw文本分类比赛方案总结_图像识别_09

模型4 Bert&GPT2

from pytorch_pretrained_bert import GPT2Model
import torch
from torch import nn


class GPT2ClassificationHeadModel(GPT2Model):

    def __init__(self, config, clf_dropout=0.4, n_class=8):
        super(GPT2ClassificationHeadModel, self).__init__(config)
        self.transformer = GPT2Model(config)
        self.dropout = nn.Dropout(clf_dropout)
        self.linear = nn.Linear(config.n_embd * 3, n_class)

        nn.init.normal_(self.linear.weight, std=0.02)
        nn.init.normal_(self.linear.bias, 0)
        
        self.apply(self.init_weights)

    def forward(self, input_ids, position_ids=None, token_type_ids=None, lm_labels=None, past=None):
        hidden_states, presents = self.transformer(input_ids, position_ids, token_type_ids, past)
        avg_pool = torch.mean(hidden_states, 1)
        max_pool, _ = torch.max(hidden_states, 1)
        h_conc = torch.cat((avg_pool, max_pool, hidden_states[:, -1, :]), 1)
        logits = self.linear(self.dropout(h_conc))
        return logits