当使用给定的数据集处理有监督机器学习时,计算机专家们一般会尝试使用不同的算法和技术去找到适合的模型以生成一般假设,力求对未来做出最准确的预测。
其实在我们处理文本分类时,也会希望使用不同的模型来训练文本分类器,“哪种机器学习模型最好呢?”,数据科学家往往会说:“要看情况(哈哈)”。其实,这反倒显示出他们的严谨,因为在实操之前,谁也没办法确定到底哪种算法的表现最好。
这篇文章将会给大家展示目前常用的文本分类模型,并通过演示来比较经过训练的文本分类模型,希望可以帮到大家找到最准确的模型进行相关研究或应用。
数据来源
我们使用的是Stack Overflow的问题与标签大数据集,这个数据目前在Google BigQuery上是开源的,你也可以在这里直接下载。
首先我们来看看数据:
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
%matplotlib inline
df = pd.read_csv('stack-overflow-data.csv')
df = df[pd.notnull(df['tags'])]
print(df.head(10))
print(df['post'].apply(lambda x: len(x.split(' '))).sum())
10276752
这是该数据集包含的词数,接下来我们通过可视化来看看tags分布情况:
my_tags = ['java','html','asp.net','c#','ruby-on-rails','jquery','mysql','php','ios','javascript','python','c','css','android','iphone','sql','objective-c','c++','angularjs','.net']
plt.figure(figsize=(10,4))
df.tags.value_counts().plot(kind='bar');
tags看上去分类很均衡
那我们再看看帖子和标签对的情况:
def print_plot(index):
example = df[df.index == index][['post', 'tags']].values[0]
if len(example) > 0:
print(example[0])
print('Tag:', example[1])print_plot(10)
print_plot(30)
这个文本数据是需要预处理的
文本预处理
数据预处理的具体步骤这里就不做讲解了,但根据文本的不同,步骤或多或少会有些许变化,但最保险的方式还是老老实实地走完所有数据预处理的步骤,这样文本清洗才能做彻底。
对于这个数据集,预处理步骤包括HTML解码、去掉停用词、文本小写化、去掉标点、删除不良字符等(具体大家可以看看代码~)
REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = BeautifulSoup(text, "lxml").text # HTML decoding
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
return text
df['post'] = df['post'].apply(clean_text)
print_plot(10)
文本看上去干净多了!
df['post'].apply(lambda x: len(x.split(' '))).sum()
3421180
经过数据预处理,现在我们的数据集里还有300w+词
接下来我们就要将文本文档转换为分词计数矩阵(CountVectorizer)。然后将这个计数矩阵转换为标准化的TF-IDF表示形式(tf-idf转换器)。最后将以Scikit-Learn库来训练这几个分类器。
X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
朴素贝叶斯分类-多项模型 Naive Bayes Classifier for Multinomial Models
作为这次比较的基准,我们首先来看看朴素贝叶斯模型吧。sk-learn里有很多朴素贝叶斯分类的变体,其中多项模型是最适合文本分类的。而为使向量化器=>转换器=>分类器更易于使用,我们使用sk-learn里的类复合分类器——Pipeline:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
nb.fit(X_train, y_train)
%%time
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
可以看到,准确率约为74%
线性支持向量机 Linear Support Vector Machine
SVM是受广泛认可的文本分类算法之一
from sklearn.linear_model import SGDClassifier
sgd = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])
sgd.fit(X_train, y_train)
%%time
y_pred = sgd.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
SVM的准确率提升了5%,约为79%
逻辑回归 Logistic Regression
逻辑回归是简单易懂的分类算法,可以推广到多个类别。
from sklearn.linear_model import LogisticRegression
logreg = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(X_train, y_train)
%%time
y_pred = logreg.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
和SVM相比,准确率稍低一些,但还是比朴素贝叶斯分类高4%,约为78%
接下来,我们将用词嵌入和神经网络等更高级的算法来看看准确率
词向量化和逻辑回归 Word2vec and Logistic Regression
词向量化和文档向量化(doc2vec)类似,都属于文本预处理范畴。Word2vec将文本转换为数字行,而作为一种映射类型,它允许具有相似含义的单词具有相似的矢量表示形式。
Word2vec的目的很简单:使用周围的单词来表示目标语言,而神经网络的隐藏层会对该单词表示进行编码。
首先我们来导入一个word2vec模型,该模型已经由Google的100 billion word Google News 语料库预训练了:
from gensim.models import Word2Vecwv = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
wv.init_sims(replace=True)
先挖掘一些词汇吧
from itertools import islice
list(islice(wv.vocab, 13030, 13050))
词袋里常见的方法是对两个单词向量求平均。在这里,我们也将遵循这个最常用的方式。
def word_averaging(wv, words):
all_words, mean = set(), []
for word in words:
if isinstance(word, np.ndarray):
mean.append(word)
elif word in wv.vocab:
mean.append(wv.syn0norm[wv.vocab[word].index])
all_words.add(wv.vocab[word].index)
if not mean:
logging.warning("cannot compute similarity with no input %s", words)
# FIXME: remove these examples in pre-processing
return np.zeros(wv.vector_size,)
mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
return mean
def word_averaging_list(wv, text_list):
return np.vstack([word_averaging(wv, post) for post in text_list ])
我们需要对文本进行分词,并将其应用于“ post”列,然后将词向量平均应用于分词后的文本里:
def w2v_tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text, language='english'):
for word in nltk.word_tokenize(sent, language='english'):
if len(word) < 2:
continue
tokens.append(word)
return tokens
train, test = train_test_split(df, test_size=0.3, random_state = 42)
test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)
运行这段代码,让我们来看看结果吧
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train['tags'])
y_pred = logreg.predict(X_test_word_average)
print('accuracy %s' % accuracy_score(y_pred, test.tags))
print(classification_report(test.tags, y_pred,target_names=my_tags))
准确率居然才64%(((
文档向量化和逻辑回归 Doc2vec and Logistic Regression
如果你理解了词向量化,那文档向量化就可以举一反三了,Word2vec学习的是单词特征表示,而Doc2vec则学习的是文档或句子。Word2vec表示的是文档中所有单词矢量的数学平均值,而Doc2vec将词与词的关系扩展到了文档间。
首先,我们需要给句子打上标签。Gensim的Doc2Vec应用要求每个文档或段落拥有与自身关联的标签,我们需要用TaggedDocumen方法来实现,格式为'TRAINi'或'TEST_i'(i是帖子的虚拟索引)
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import Doc2Vec
from sklearn import utils
import gensim
from gensim.models.doc2vec import TaggedDocument
import re
def label_sentences(corpus, label_type):
"""
Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
a dummy index of the post.
"""
labeled = []
for i, v in enumerate(corpus):
label = label_type + '_' + str(i)
labeled.append(doc2vec.TaggedDocument(v.split(), [label]))
return labeled
X_train, X_test, y_train, y_test = train_test_split(df.post, df.tags, random_state=0, test_size=0.3)
X_train = label_sentences(X_train, 'Train')
X_test = label_sentences(X_test, 'Test')
all_data = X_train + X_test
让我们来看看打上标签后的文档:
all_data[:2]
在训练Doc2vec模型时,我们需要调整以下参数:
-
dm=0
, 使用分布式词袋 (DBOW) -
vector_size=300
, 300个特征向量 -
negative=5
, 制定绘制5个噪声词 -
min_count=1
,忽略所有出现频率低于1的词 -
alpha=0.065
, 设置初始学习率
初始化Doc2vec模型阶段,我们先训练30个纪元吧
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])
for epoch in range(30):
model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), total_examples=len(all_data), epochs=1)
model_dbow.alpha -= 0.002
model_dbow.min_alpha = model_dbow.alpha
接下来,我们从经过训练的Doc2vec模型中取得向量
def get_vectors(model, corpus_size, vectors_size, vectors_type):
"""
Get vectors from trained doc2vec model
:param doc2vec_model: Trained Doc2Vec model
:param corpus_size: Size of the data
:param vectors_size: Size of the embedding vectors
:param vectors_type: Training or Testing vectors
:return: list of vectors
"""
vectors = np.zeros((corpus_size, vectors_size))
for i in range(0, corpus_size):
prefix = vectors_type + '_' + str(i)
vectors[i] = model.docvecs[prefix]
return vectors
train_vectors_dbow = get_vectors(model_dbow, len(X_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test), 300, 'Test')
让我们来看看结果吧!
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors_dbow, y_train)
logreg = logreg.fit(train_vectors_dbow, y_train)
y_pred = logreg.predict(test_vectors_dbow)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
达到80%了!
深度学习框架下的词袋 BOW with Keras
最后,我们将使用Python的深度学习框架Keras来进行文本分类
以下代码主要参考一个Google的workshop,感兴趣的童靴可以仔细研究代码,步骤说明都有清晰注释,这里就不展开讨论了哈
import itertools
import os
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
train_size = int(len(df) * .7)
train_posts = df['post'][:train_size]
train_tags = df['tags'][:train_size]
test_posts = df['post'][train_size:]
test_tags = df['tags'][train_size:]
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)
tokenize.fit_on_texts(train_posts) # only fit on train
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
encoder = LabelEncoder()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)
batch_size = 32
epochs = 2
# Build the model
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_split=0.1)
话不多说,我们直接来看结果:
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])
79.5%!
以上就是各种模型的准确率跑分结果了,当然,根据数据集的不同,模型准确率肯定会有所变化,所以究竟哪个文本分类模型最好呢?答案是“实践”,需要你自己试过之后才能做出选择~