引言

本次主要是使用清华大学已有的分类算法进行训练和测试,用了两台不同的笔记本,一台是8G,普通硬盘,一台是16G,i7处理器,T580 Thinkpad进行训练

经过对比,发现不同的硬件设备处理效能差别很明显,神经网络的训练需要一台好的机器,否则效率极低。

数据集

本文采用了清华NLP组提供的THUCNews新闻文本分类数据集的一个子集(原始的数据集大约74万篇文档,训练起来需要花较长的时间)。数据集请自行到THUCTC:一个高效的中文文本分类工具包下载,请遵循数据提供方的开源协议。

#数据划分
本次训练使用了其中的10个分类,每个分类6500条,总共65000条新闻数据。
数据集划分如下:
训练集: 500010
验证集: 500
10
测试集: 1000*10

从原数据集生成子集的过程请参看helper下的两个脚本。其中,copy_data.sh用于从每个分类拷贝6500个文件,cnews_group.py用于将多个文件整合到一个文件中。执行该文件后,得到三个数据文件:
cnews.train.txt: 训练集(50000条)
cnews.val.txt: 验证集(5000条)
cnews.test.txt: 测试集(10000条)

预处理

data/cnews_loader.py为数据的预处理文件。
read_file(): 读取文件数据;
build_vocab(): 构建词汇表,使用字符级的表示,这一函数会将词汇表存储下来,避免每一次重复处理;
read_vocab(): 读取上一步存储的词汇表,转换为{词:id}表示;
read_category(): 将分类目录固定,转换为{类别: id}表示;
to_words(): 将一条由id表示的数据重新转换为文字;
preocess_file(): 将数据集从文字转换为固定长度的id序列表示;
batch_iter(): 为神经网络的训练准备经过shuffle的批次的数据。

CNN卷积神经网络

配置项
CNN可配置的参数如下所示,在cnn_model.py中。

class TCNNConfig(object):
    """CNN配置参数"""

    embedding_dim = 64      # 词向量维度
    seq_length = 600        # 序列长度
    num_classes = 10        # 类别数
    num_filters = 128        # 卷积核数目
    kernel_size = 5         # 卷积核尺寸
    vocab_size = 5000       # 词汇表达小

    hidden_dim = 128        # 全连接层神经元

    dropout_keep_prob = 0.5 # dropout保留比例
    learning_rate = 1e-3    # 学习率

    batch_size = 64         # 每批训练大小
    num_epochs = 10         # 总迭代轮次

    print_per_batch = 100    # 每多少轮输出一次结果
    save_per_batch = 10      # 每多少轮存入tensorboard

#CNN模型

具体参看cnn_model.py的实现。

大致结构如下:

nlp有害数据数据集_机器学习

训练与验证

运行 python run_cnn.py train,可以开始训练。

Training and evaluating...
Epoch: 1
Iter:      0, Train Loss:    2.3, Train Acc:  12.50%, Val Loss:    2.3, Val Acc:  10.02%, Time: 0:00:04 *
Iter:    100, Train Loss:    1.0, Train Acc:  68.75%, Val Loss:    1.2, Val Acc:  65.78%, Time: 0:00:23 *
Iter:    200, Train Loss:   0.36, Train Acc:  89.06%, Val Loss:   0.66, Val Acc:  80.04%, Time: 0:00:47 *
Iter:    300, Train Loss:   0.28, Train Acc:  90.62%, Val Loss:   0.43, Val Acc:  87.36%, Time: 0:01:11 *
Iter:    400, Train Loss:   0.17, Train Acc:  90.62%, Val Loss:   0.38, Val Acc:  89.28%, Time: 0:01:35 *
Iter:    500, Train Loss:   0.12, Train Acc:  96.88%, Val Loss:   0.27, Val Acc:  92.72%, Time: 0:01:59 *
Iter:    600, Train Loss:   0.22, Train Acc:  90.62%, Val Loss:   0.32, Val Acc:  91.94%, Time: 0:02:22
Iter:    700, Train Loss:   0.11, Train Acc:  95.31%, Val Loss:   0.27, Val Acc:  92.24%, Time: 0:02:47
Epoch: 2
Iter:    800, Train Loss:    0.1, Train Acc:  95.31%, Val Loss:   0.24, Val Acc:  92.86%, Time: 0:03:11 *
Iter:    900, Train Loss:   0.18, Train Acc:  95.31%, Val Loss:   0.21, Val Acc:  93.96%, Time: 0:03:35 *
Iter:   1000, Train Loss:  0.074, Train Acc:  96.88%, Val Loss:   0.24, Val Acc:  93.28%, Time: 0:03:58
Iter:   1100, Train Loss:   0.13, Train Acc:  96.88%, Val Loss:   0.22, Val Acc:  93.92%, Time: 0:04:22
Iter:   1200, Train Loss:   0.09, Train Acc:  95.31%, Val Loss:   0.21, Val Acc:  93.96%, Time: 0:04:45
Iter:   1300, Train Loss:  0.073, Train Acc:  98.44%, Val Loss:   0.24, Val Acc:  92.84%, Time: 0:05:09
Iter:   1400, Train Loss:   0.12, Train Acc:  96.88%, Val Loss:    0.2, Val Acc:  94.70%, Time: 0:05:32 *
Iter:   1500, Train Loss:  0.048, Train Acc:  98.44%, Val Loss:   0.21, Val Acc:  94.38%, Time: 0:05:56
Epoch: 3
Iter:   1600, Train Loss:  0.053, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  93.92%, Time: 0:06:19
Iter:   1700, Train Loss:   0.17, Train Acc:  93.75%, Val Loss:   0.19, Val Acc:  94.82%, Time: 0:06:43 *
Iter:   1800, Train Loss:   0.14, Train Acc:  93.75%, Val Loss:   0.18, Val Acc:  94.76%, Time: 0:07:07
Iter:   1900, Train Loss:  0.098, Train Acc:  95.31%, Val Loss:   0.19, Val Acc:  94.50%, Time: 0:07:32
Iter:   2000, Train Loss:  0.041, Train Acc:  98.44%, Val Loss:   0.17, Val Acc:  95.30%, Time: 0:07:56 *
Iter:   2100, Train Loss:   0.01, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  94.98%, Time: 0:08:20
Iter:   2200, Train Loss:  0.069, Train Acc:  98.44%, Val Loss:   0.21, Val Acc:  93.90%, Time: 0:08:44
Iter:   2300, Train Loss:  0.067, Train Acc:  98.44%, Val Loss:    0.2, Val Acc:  93.36%, Time: 0:09:07
Epoch: 4
Iter:   2400, Train Loss:  0.045, Train Acc:  96.88%, Val Loss:   0.19, Val Acc:  94.86%, Time: 0:09:31
Iter:   2500, Train Loss:  0.036, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  94.64%, Time: 0:09:54
Iter:   2600, Train Loss:  0.029, Train Acc:  98.44%, Val Loss:   0.19, Val Acc:  94.84%, Time: 0:10:18
Iter:   2700, Train Loss:  0.015, Train Acc: 100.00%, Val Loss:   0.22, Val Acc:  94.26%, Time: 0:10:42
Iter:   2800, Train Loss:  0.024, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  94.58%, Time: 0:11:10
Iter:   2900, Train Loss: 0.0021, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  94.24%, Time: 0:11:32
Iter:   3000, Train Loss:   0.14, Train Acc:  98.44%, Val Loss:   0.26, Val Acc:  92.06%, Time: 0:11:56
No optimization for a long time, auto-stopping...

测试

运行 python run_cnn.py test 在测试集上进行测试。

Testing...
Test Loss:   0.11, Test Acc:  96.65%
Precision, Recall and F1-Score...
             precision    recall  f1-score   support

         体育       0.99      0.99      0.99      1000
         财经       0.95      0.99      0.97      1000
         房产       1.00      1.00      1.00      1000
         家居       0.97      0.91      0.94      1000
         教育       0.92      0.93      0.93      1000
         科技       0.94      0.98      0.96      1000
         时尚       0.96      0.98      0.97      1000
         时政       0.98      0.92      0.95      1000
         游戏       0.97      0.98      0.98      1000
         娱乐       0.98      0.97      0.98      1000

avg / total       0.97      0.97      0.97     10000

Confusion Matrix...
[[995   0   0   0   3   1   0   1   0   0]
 [  0 995   1   0   1   1   0   2   0   0]
 [  0   0 996   3   1   0   0   0   0   0]
 [  1  18   1 914  15  14  18  10   5   4]
 [  1   6   0   9 932  23  10   6  10   3]
 [  0   0   0   1   4 984   3   0   8   0]
 [  1   0   0   6   6   2 977   0   1   7]
 [  1  20   2   3  38  16   0 917   1   2]
 [  1   3   0   0   4   2   4   0 984   2]
 [  2   2   1   7   4   3   5   0   5 971]]
Time usage: 0:00:17

在测试集上的准确率达到了96.04%,且各类的precision, recall和f1-score都超过了0.9。

从混淆矩阵也可以看出分类效果非常优秀。

参考链接:CNN字符级中文文本分类-基于TensorFlow实现