1、本文建立在这篇文章的基础上,并对《Implementing a CNN for Text Classification in TensorFlow》原论文的方法进行复现,去掉了embedding层,用Word2Vec来代替词向量。
原论文中是用了六个卷积核做特征提取,分别为两个2embed_size,两个3embed_size,两个4*embed_size的卷积核。在这里完整地复现了论文,并稍做了参数的调整。
2、完整训练过程如下(没有做类的封装):
(1)读取数据,做 字符 —> id 的vector转化:
with open('./cnews/cnews.vocab.txt', encoding='utf8') as file:
vocabulary_list = [k.strip() for k in file.readlines()]
with open('./cnews/cnews.train.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
train_label_list = [k.split(maxsplit=1)[0] for k in line_list]
train_content_list = [k.split(maxsplit=1)[1] for k in line_list]
with open('./cnews/cnews.test.txt', encoding='utf8') as file:
line_list = [k.strip() for k in file.readlines()]
test_label_list = [k.split(maxsplit=1)[0] for k in line_list]
test_content_list = [k.split(maxsplit=1)[1] for k in line_list]
word2id_dict = dict(((b, a) for a, b in enumerate(vocabulary_list)))
def content2vector(content_list):
content_vector_list = []
for content in content_list:
content_vector = []
for word in content:
if word in word2id_dict:
content_vector.append(word2id_dict[word])
else:
content_vector.append(word2id_dict['<PAD>'])
content_vector_list.append(content_vector)
return content_vector_list
train_vector_list = content2vector(train_content_list)
test_vector_list = content2vector(test_content_list)
(2)对句子的id向量进行统一补零和截断处理。
import tensorflow.contrib.keras as kr
train_X = kr.preprocessing.sequence.pad_sequences(train_vector_list,600)
test_X = kr.preprocessing.sequence.pad_sequences(test_vector_list,600)
(3)参数初始化:
vocab_size = 5000
kernel_sizes = [2,2,3,3,4,4]
dropout_keep_prob = 0.5
num_kernels = 128
batch_size = 64
seq_length = 600
embed_size = 128
hidden_dim = 256
num_classes = 10
learning_rate = 1e-3
embedding_dim = 128 # 词向量维度
(4)设置占位符以及embedding层:
import tensorflow as tf
X_holder = tf.placeholder(tf.int32,[None,seq_length])
Y_holder = tf.placeholder(tf.float32,[None,num_classes])
embedding = tf.get_variable('embedding', [vocab_size, embedding_dim,1])
embedding_inputs = tf.nn.embedding_lookup(embedding, X_holder)
(5)中间的卷基层搭建:
def conv_pool_concate(X_holder,kernel_sizes):
output = []
for kernel_size in kernel_sizes:
conv2 = tf.layers.conv2d(inputs=X_holder,
filters=num_kernels,
kernel_size=[kernel_size,embed_size],
use_bias=True,
padding='VALID',
bias_initializer=tf.zeros_initializer(),
activity_regularizer=tf.nn.relu)
pool = tf.nn.max_pool(value=conv2,
ksize=[1,seq_length-kernel_size+1,1,1],
strides=[1,1,1,1],
padding='VALID')
output.append(pool)
concat_ = tf.concat(output,3)
return concat_
(6)卷基-池化层之后的特征拼接,先将数值为1的维度全部去掉,然后将通道concat:
concat = conv_pool_concate(embedding_inputs,kernel_sizes)
squeeze = tf.squeeze(concat,axis=[1,2])
squeeze_dim = squeeze.get_shape().as_list()
(7)卷基层之后的模型搭建,与前述文章相近,注意维度即可:
full_connect1 = tf.layers.dense(squeeze,hidden_dim)
full_connect1_dropout = tf.contrib.layers.dropout(full_connect1,keep_prob=0.5)
full_connect1_dropout_activate = tf.nn.relu(full_connect1_dropout)
full_connect2 = tf.layers.dense(full_connect1_dropout_activate,num_classes)
predict_y = tf.nn.softmax(full_connect2)
cross_entry = tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y_holder,logits=full_connect2)
loss = tf.reduce_mean(cross_entry)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train = optimizer.minimize(loss)
correct = tf.equal(tf.argmax(Y_holder,1),tf.argmax(predict_y,1))
accuracy = tf.reduce_mean(tf.cast(correct,tf.float32))
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
(8)构建训练标签值,先labelencode,再进行one-hot。
from sklearn.preprocessing import LabelEncoder
from tensorflow.contrib import keras as kr
label = LabelEncoder()
train_Y = kr.utils.to_categorical(label.fit_transform(train_label_list),num_classes=num_classes)
test_Y = kr.utils.to_categorical(label.fit_transform(test_label_list),num_classes=num_classes)
(9)模型训练:
import random
for i in range(5000):
train_index = random.sample(list(range(len(train_Y))),k=batch_size)
X = train_X[train_index]
Y = train_Y[train_index]
sess.run(train,feed_dict={X_holder:X,Y_holder:Y})
step = i + 1
if step % 100 == 0:
test_index = random.sample(list(range(len(test_Y))), k=batch_size)
x = test_X[test_index]
y = test_Y[test_index]
loss_value, accuracy_value = sess.run([loss, accuracy], {X_holder:x, Y_holder:y})
print('step:%d loss:%.4f accuracy:%.4f' %(step, loss_value, accuracy_value))
训练过程(某一部分)结果输出如下:
step:1600 loss:0.3221 accuracy:0.9219
step:1700 loss:0.0952 accuracy:0.9531
step:1800 loss:0.0241 accuracy:1.0000
step:1900 loss:0.0666 accuracy:0.9688
step:2000 loss:0.2998 accuracy:0.9062
step:2100 loss:0.3763 accuracy:0.9062
step:2200 loss:0.0264 accuracy:1.0000
step:2300 loss:0.1595 accuracy:0.9531
step:2400 loss:0.0955 accuracy:0.9688
step:2500 loss:0.2045 accuracy:0.9531
step:2600 loss:0.0479 accuracy:0.9688
step:2700 loss:0.0360 accuracy:0.9844
(10)混淆矩阵以及评价分析:
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy as np
predict_value = []
for i in range(0,len(test_X),200):
x_test = test_X[i:i+200]
predict = np.array(sess.run(predict_y,feed_dict={X_holder:x_test}))
predict_ = np.argmax(predict,1)
predict_value.extend(predict_)
predict_label = label.inverse_transform(predict_value)
df = pd.DataFrame(confusion_matrix(test_label_list,predict_label),columns=label.classes_,index=label.classes_)
print(df)
import numpy as np
from sklearn.metrics import precision_recall_fscore_support
def eval_model(y_true, y_pred, labels):
# 计算每个分类的Precision, Recall, f1, support
p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
# 计算总体的平均Precision, Recall, f1, support
tot_p = np.average(p, weights=s)
tot_r = np.average(r, weights=s)
tot_f1 = np.average(f1, weights=s)
tot_s = np.sum(s)
res1 = pd.DataFrame({
u'Label': labels,
u'Precision': p,
u'Recall': r,
u'F1': f1,
u'Support': s
})
res2 = pd.DataFrame({
u'Label': ['总体'],
u'Precision': [tot_p],
u'Recall': [tot_r],
u'F1': [tot_f1],
u'Support': [tot_s]
})
res2.index = [999]
res = pd.concat([res1, res2])
return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]
eval_model(test_label_list, predict_label, label.classes_)
结果如下:
体育 娱乐 家居 房产 教育 时尚 时政 游戏 科技 财经
体育 990 2 1 0 2 0 0 4 0 1
娱乐 1 973 5 0 4 3 1 8 4 1
家居 1 6 861 1 15 46 26 5 19 20
房产 0 2 0 996 0 0 0 0 0 2
教育 0 34 7 0 821 19 20 60 28 11
时尚 1 8 2 0 3 978 1 3 3 1
时政 0 7 4 1 26 2 931 3 10 16
游戏 0 10 1 0 0 10 1 974 1 3
科技 0 2 5 0 0 1 4 11 975 2
财经 0 1 0 0 4 0 5 0 4 986
Label Precision Recall F1 Support
0 体育 0.996979 0.9900 0.993477 1000
1 娱乐 0.931100 0.9730 0.951589 1000
2 家居 0.971783 0.8610 0.913043 1000
3 房产 0.997996 0.9960 0.996997 1000
4 教育 0.938286 0.8210 0.875733 1000
5 时尚 0.923513 0.9780 0.949976 1000
6 时政 0.941355 0.9310 0.936149 1000
7 游戏 0.911985 0.9740 0.941973 1000
8 科技 0.933908 0.9750 0.954012 1000
9 财经 0.945350 0.9860 0.965247 1000
999 总体 0.949226 0.9485 0.947820 10000
从以上结果来看模型训练效果不错,其中准确率和召回率都达到了94%+,以及F1值也达到了94%+,可以投入实际使用,也许中间调下结果可以获得更高的比率:)
3、好了,以下是失败的尝试,本作者把embedding层用word2vec来代替,也就是一开始就给模型喂了词向量,来看看结果如何。结果惨不忍睹:( 发现模型不管怎么训练在最后的准确率都不会超过30%。
在这里词向量的构建主要还是依照同样的维度去构建,也即128,但是提前先用word2vec训练好,并且每个句子的长度也按照以上补零或截断操作,最后统一为600维度。其中训练词向量的预料为整个训练集与测试集的大小,以保证词的完整性。
好了,由于一下代码有大量重复,这里只放关键的函数部分,以及训练部分:
import numpy as np
def contentToList(contents):
content_list = []
for cont in contents:
cont_i = []
for word in cont:
if word not in vocabulary_list:
cont_i.append('<PAD>')
else:
cont_i.append(word)
if len(cont_i)>=600:
cont_i = cont_i[-600:]
else:
while len(cont_i)<600:
cont_i.insert(0,'<PAD>')
content_list.append(cont_i)
return np.array(content_list)
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus,size=128,window=5,min_count=5,workers=4)
output_dir = u'output_word2vec'
import os
if not os.path.exists(output_dir):
os.mkdir(output_dir)
model_file = os.path.join(output_dir, 'model.w2v')
model.save(model_file)
import numpy as np
def transform_to_matrix(x, padding_size=600, vec_size=128):
res = []
for sen in x:
mat = []
for i in range(padding_size):
if sen[i]=='<PAD>':
mat.append([0] * vec_size)
else:
mat.append(model[sen[i]].tolist())
matrix = np.array(mat)
matrix = matrix.reshape(matrix.shape[0],matrix.shape[1],1)
yield matrix
# res.append(mat)
# matrix = np.array(res)
# matrix = matrix.reshape(matrix.shape[0],matrix.shape[1],matrix.shape[2],1)
# return matrix
X_holder = tf.placeholder(tf.float32,[None,seq_length,embed_size,1])
Y_holder = tf.placeholder(tf.float32,[None,num_classes])
以上由于词向量一次性转换的话会消耗大量内存,所以才要将将词向量转化的函数用generator生成器来代替,这样子每次值yield一个值,不会造成内存空间爆炸。(作者曾因为这个事电脑蓝屏了3次:)
好了其他都一样,接下来训练:
train_list = contentToList(train_content_list)
test_list = contentToList(test_content_list)
import random
#每个batch单独进行词向量拼接,因为一次性拼接的计算量太大,内存不足,所以分开拼接,只不过时间比较长
for i in range(5000):
train_index = random.sample(list(range(len(train_Y))),k=batch_size)
x_generator = transform_to_matrix(train_list[train_index])
y_train = train_Y[train_index]
# for z,x in enumerate(x_generator):
# Ys = y_train[z]
# Y = Ys.reshape(1,Ys.shape[0])
# X = x
Y = y_train
X = np.array(list(x_generator))
sess.run(train,feed_dict={X_holder:X,Y_holder:Y})
step = i + 1
if step % 100 == 0:
test_index = random.sample(list(range(len(test_Y))), k=batch_size)
x_generator_test = transform_to_matrix(test_list[test_index])
y_test = test_Y[test_index]
# for u,v in enumerate(x_generator_test):
# ys = y_test[u]
# y = ys.reshape(1,ys.shape[0])
# x = v
y = y_test
x = np.array(list(x_generator_test))
loss_value, accuracy_value = sess.run([loss, accuracy], {X_holder:x, Y_holder:y})
print('step:%d loss:%.4f accuracy:%.4f' %(step, loss_value, accuracy_value))
结果如下(ps:看到这结果让我不得不提前结束了训练。。):
step:100 loss:2.3006 accuracy:0.0938
step:200 loss:2.3043 accuracy:0.0312
step:300 loss:2.3036 accuracy:0.1250
step:400 loss:2.3050 accuracy:0.0938
step:500 loss:2.3048 accuracy:0.1250
step:600 loss:2.3015 accuracy:0.1250
step:700 loss:2.2998 accuracy:0.1250
step:800 loss:2.3049 accuracy:0.0312
step:900 loss:2.3027 accuracy:0.0312
step:1000 loss:2.2990 accuracy:0.1875
step:1100 loss:2.3037 accuracy:0.0312
step:1200 loss:2.3037 accuracy:0.0312
step:1300 loss:2.2965 accuracy:0.2500
step:1400 loss:2.3021 accuracy:0.0625
本文作者还尝试了将所有的词向量加起来去训练,但是效果也是不佳。以此作为记录。