0509打卡

原创

晨旭不想写程序 2024-05-09 20:24:29 ©著作权

文章标签 数据集词向量池化 文章分类 深度学习人工智能

©著作权归作者所有：来自51CTO博客作者晨旭不想写程序的原创作品，请联系作者获取转载授权，否则将追究法律责任

对知识点进行了学习

中文情感分析

code environment

在 python3.x & Tensorflow1.x 下工作正常

语料的准备

语料的选择为 谭松波老师的评论语料，正负例各2000。属于较小的数据集

词向量的准备

本实验使用开源词向量chinese-word-vectors

选择知乎语料训练而成的Word Vector

模型：CNN

结构：

中文词Embedding
多个不同长度的定宽卷积核
最大池化层，每个滤波器输出仅取一个最大值
全连接

模型主要代码解析

数据预处理

def parse_fn(line_words, line_tag):
    # Encode in Bytes for TF
    words = [w.encode() for w in line_words.strip().split()]
    tag = line_tag.strip().encode()
    return words, tag

定义了一个名为`parse_fn`的函数，它接收`line_words`和`line_tag`作为输入。它将`line_words`字符串按空格分隔成单词列表，去除每个单词的首尾空白，并使用`encode()`方法将其编码为字节类型。同样地，`line_tag`也去除了首尾空白并编码为字节类型。函数返回一个元组`(words, tag)`。

def generator_fn(words, tags):
    with Path(words).open('r', encoding='utf-8') as f_words, Path(tags).open('r', encoding='utf-8') as f_tags:
        for line_words, line_tag in zip_words, f_tags):
            yield parse_fn(line_words, line_tag)

定义了一个名为`generator_fn`的函数，它接收`words`和`tags`作为输入。使用`Path`打开两个指定路径的文件以供读取（采用UTF-8编码）。通过使用`zip`同时迭代这两个文件的每行。对于每一行，使用`parse_fn`对`line_words`和`line_tag`进行处理，并使用`yield`返回结果。该函数作为生成器，产生包含`(words, tag)`元组的值。

def input_fn(words_path, tags_path, params=None, shuffle_and_repeat=False):
    params = params if params is not None else {}

这个函数定义了输入数据的函数。它接受`words_path`和`tags_path`作为输入参数，`params`用于指定其他参数，默认为`None`。

shapes = ([None], ())  # shape of every sample
types = (tf.string, tf.string)
defaults = ('<pad>', '')

定义了数据样本的形状（`shapes`），类型（`types`）和默认值（`defaults`）。每个样本由一个长度可变的字符串列表和一个字符串标签组成。

dataset = tf.data.Dataset.from_generator(
    functools.partial(generator_fn, words_path, tags_path),
    output_shapes=shapes, output_types=types).map(lambda w, t: (w[:params.get('nwords', 300)], t))

创建了一个`tf.data.Dataset`对象。使用`tf.data.Dataset.from_generator`从生成器函数`generator_fn`创建数据集。`functools.partial`将`words_path`和`tags_path`作为参数传递给`generator_fn`。指定了输出的形状和类型。通过`map`方法对每个样本进行处理，只保留前`params.get('nwords',300)`个单词，并保留原始的标签。

if shuffle_and_repeat:
    dataset = dataset.shuffle(params['buffer']).repeat(params['epochs'])

如果`shuffle_repeat`为`True`，则对数据集进行洗牌和重复操作。使用`shuffle`方法对数据集进行洗牌，并设置缓冲区大小为`params['buffer']`。然后使用`repeat`方法对数据集进行重复操作，重复次数为`params['epochs']`。

dataset = (dataset
           .padded_batch(params.get('batch_size', 20), ([params.get('nwords', 300)], ()), defaults)
           .prefetch(1))

对数据集进行填充批处理操作。使用`padded_batch`方法将数据集分成指定的批次大小，并在不同批次中对长度不同的样本进行填充。批次大小由`params.get('batch_size', 20)`指定，默认为20。填充的形状为`([params.get('nwords', 300)], ())`，其中`params.get('nwords', 300)`表示单词的最大数量，默认为300。未指定填充值的情况下，将使用默认值`('<pad>', '')`进行填充。最后，使用`prefetch`方法提前加载一批数据用于后续训练的处理。

return dataset

返回处理后的数据集。

模型定义

def model_fn(features, labels, mode, params):
    if isinstance(features, dict):
        features = features['words']

定义了模型函数`model_fn`，它接收`features`，`labels`，`mode`和`params`作为输入参数。如果`features`是一个字典，则将其赋值为`features['words']`。

dropout = params.get('dropout', 0.5)
training (mode == tf.estimator.ModeKeys.TRAIN)
vocab_words = tf.contrib.lookup.index_table_from_file(
    params['words'], num_oov_buckets=params['num_oov_buckets'])

从`params`中获取`dropout`的值（默认为0.5）。根据`mode`是否为训练模式，设置`training`变量为`True`或`False`。使用`tf.contrib.lookup.index_table_from_file`根据文件`params['words']`创建一个单词的索引表，并指定了OOV（Out-of-vocabulary）桶的数量为`params['num_oov_buckets']`。

with Path(params['tags']).open() as f:
    indices = [idx for idx, tag in enumerate(f)]
    num_tags = len(indices)

使用`Path`打开文件`params['tags']`，并遍历其中的行，获取每个标签的索引。计算得到标签的数量`num_tags`。

word_ids = vocab_words.lookup(features)
w2v = np.load(params['w2v'])['embeddings']
w2v_var = np.vstack([w2v, [[0.] * params['dim']]])
w2v_var = tf.Variable(w2v_var, dtype=tf.float32, trainable=False)
embeddings = tf.nn.embedding_lookup(w2v_var, word_ids)

使用单词索引表`vocab_words`将输入的`features`转换预训练的词向量`w2v`，并创建一个`tf.Variable`变量`w2v_var`保存词向量。使用`tf.nn.embedding_lookup`根据单词ID查找对应的词向量，并得到嵌入矩阵`embeddings`。

embeddings = tf.layers.dropout(embeddings, rate=dropout, training=training)
embeddings_expanded = tf.expand_dims(embeddings, -1)

使用`tf.layers.dropout`对嵌入矩阵`embeddings`进行丢弃操作，以减少过拟合。然后，使用`tf.expand_dims`在最后一维上添加一个维度。

pooled_outputs = []
for i, filter_size in enumerate(params['filter_sizes']):
    conv2 = tf.layers.conv2d(embeddings_expanded, params['num_filters'], kernel_size=[filter_size, params['dim']],
                             activation=tf.nn.relu, name='conv-{}'.format(i))
    pooled = tf.layers.max_pooling2d(inputs=conv2, pool_size=[params['nwords'] - filter_size + 1, 1],
                                     strides=[1, 1], name='pool-{}'.format(i))
    pooled_outputs.append(pooled)

定义一个空列表`pooled_outputs`，用于存储卷积层输出的池化结果。使用`tf.layers.conv2d`进行卷积操作，指定卷积核的大小为`[filter_size, params['dim']]`，激活函数为ReLU，并命名`tf.layers.max_pooling2d`进行最大池化操作，指定池化窗口的大小为`[params['nwords_size + 1, 1]`，步幅为`[ 1]`，命名为`'pool-{}'.format(i)`。将池化结果`pooled`加入到`pooled_outputs`列表中。

num_total_filters = params['num_filters'] * len(params['filter_sizes'])
h_poll = tf.concat(pooled_outputs, 3)
output = tf.reshape(h_poll, [-1,```
计算总的过滤器数量`num_total_filters`，即`params['num_filters']`乘以卷积核尺寸的种类数。使用`tf.concat`在第三维上将所有池化结果拼接起来，得形状为`[batch_size, nwords, num_total_filters]`的`h_pool`。然后，使用``将`h_pool`展平成形状为`[-1, num_total_filters]`的张量。最后，再次使用`tf.layers.dropout`对输出进行丢弃操作。

```python
logits = tf.layers.dense(output, num_tags)
pred_ids = tf.argmax(input=logits, axis=1)

使用全连接层`tf.layers.dense`将输出转换为具有`num_tags`个输出节点的向量。使用`tf.argmax`在`axis=1`上找出预测值索引，得到最终的预测标签索引`pred_ids`。

if mode == tf.estimator.ModeKeys.PREDICT:
    reversed_tags = tf.contrib.lookup.index_to_string_table_from_file(params['tags'])
    pred_labels = reversed_tags.lookup(tf.argmax(inputits, axis=1))
    predictions = {
        'classes_id': pred        'labels': pred_labels
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
else:
    tags_table = tf.contrib.lookup.index_table_from_file(params['tags tags = tags_table.lookup(labels)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=tags, logits=logits)

如果是预测模式(`mode == tf.estimator.ModeKeys.PREDICT`)，则加载标签反查表`reversed_tags`，将预测标签索引转换为实际标签，并构建预测结果字典`predictions`。之后，返回一个`tf.estimator.EstimatorSpec`对象，其中包含预测结果。如果不是预测模式，则使用标签索引表`tags_table`将实际标签转换为索引形式，并通过`sparse_softmax_cross_entropy`计算损失值`loss`。

metrics = {
    'acc': tf.metrics.accuracy(tags, pred_ids),
    'precision': tf.metrics.precision(tags, pred_ids),
    'recall': tf.metrics.recall(tags, pred_ids)
}

for metric_name, op in metrics.items():
    tf.summary.scalar(metric_name, op[1])

定义了评估指标`metrics，包括准确率（'acc'），精确度（'precision'）和召回率（'recall'）。使用`tf.metrics.accuracy`、`tf.metrics.precision`和`tf.metrics.recall`计算这些指标，并将其添加到TensorBoard的摘要中。

if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = tf.train.AdamOptimizer().minimize(
        loss, global_step=tf.train.get_or_create_global_step())
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
elif mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(
        mode, loss=loss, eval_metric_ops=metrics)

如果是训练模式，则创建一个Adam优化器并最小化损失函数。返回一个`tf.estimator.EstimatorSpec`对象，其中包括损失值`loss`和训练操作`train_op`。如果是评估模式，则返回一个包含损失值`loss`和评估指标`eval_metric_ops`的`tf.estimator.EstimatorSpec`对象。

训练过程

if __name__ == '__main__':
	params = {
    'dim': 300,
    'nwords': 300,
    'filter_sizes': [2, 3, 4],
    'num_filters': 64,
    'dropout': 0.6,
    'num_oov_buckets': 1,
    'epochs': 50,
    'batch_size': 20,
    'buffer': 3500,
    'words': str(Path(DATA_DIR, 'vocab.words.txt')),
    'tags': str(Path(DATA_DIR, 'vocab.labels.txt')),
    'w2v': str(Path(DATA_DIR, 'w2v.npz'))
}

定义一个名为`params`的字典，其中包含了一组模型超参数。这些超参数包括词向量的维度`dim`、句子最大长度`nwords`、卷积核大小的列表`filter_sizes`、卷积核数量`num_filters`、丢弃率`dropout`、OOV桶的数量`num_oov_buckets`、训练轮数`epochs`、批处理大小`batch_size`、缓冲区大小`buffer`以及用于存储词汇表、标签和词向量的文件路径。

with Path('../../../Mutual-AI/model/chinese_sentiment_analysis/params.json').open('w') as f:
    json.dump(params, f, indent=4, sort_keys=True)

使用`Path`打开文件`'../../../Mutual-AI/model/chinese_sentiment_analysis/params.json'`，以写式打开文件。然后使用`json.dump`以JSON格式写入文件中，缩进为4个空格，按键排序。

def fwords(name):
    return str(Path(DATA_DIR, '{}.words.txt'.format(name)))

定义了一个名为`fwords`的函数，它接收一个字符串参数`name`，并返回使用`name`构造路径的结果。路径由`DATA_DIR`和`{}.words.txt`组成，其中`{}`将被替换为函数的参数`name`。

def ftags(name):
    return str(Path(DATA_DIR, '{}.txt'.format(name)))

定义了一个名为`ftags`的，它接收一个字符串参数`name`，并返回使用`name`构造路径的结果。路径由`{}`将被替换为函数的参数`name`。

train_inpf = functools.partial(input_fn, fwords('train'),
                               params, shuffle_and_repeat=True)
eval_inpf = functools.partial(input_fn, fwords('eval'), ftags('eval'))

创建了两个特殊函数``eval_inpf`，它们是通过部分应input_fn`函数获得的。`train_inpf`用于训练数据集，会将`fwords('train')`和`ftags('train')`作为输入路径，并附带其他参数`params`和`shuffle_and_repeat=True`。`eval_inpf`用于评估数据集，会将`fwords('eval')`和`ftags('eval')`作为输入路径。

cfg = tf.estimator.RunConfig(save_checkpoints_secs=10)
estimator = tf.estimator.Estimator(model_fn, 'results/model', cfg, params)
Path(estimator.eval_dir()).mkdir(parents=True, exist_ok=True)

创建了一个`tf.estimator.RunConfig`对象`cfg`，用于配置训练的运行环境，设置每隔10秒保存检查点。然后，使用这个运行配置、模型函数`model_fn`、模型保存路径`'results/model'`和超参数`params`创建了一个`tf.estimator.Estimator`estimator.eval_dir()`获取评估结果的保存路径，并使用`Path`创建该路径的父目录，确保路径存在。

train_spec = tf.estimator.TrSpec(input_fn=train_inpf)
eval_spec = tf.EvalSpec(input_fn=eval_inpf, throttle_secs=10)

创建了训练规格`train_spec`和评估规格`eval_spec`。`TrainSpec`指`train_inpf`作为输入函数进行进行评估，其中还设置了每隔10秒进行一次.train_and_evaluate(estimator, train_spec, eval_spec) tf.estimator.train_and_evaluate`函数同时进行训和`eval_spec`传递给该函数。

def write_predictions(name):
    Path('results/score').mkdir(parents=True, exist_ok=True)
    with Path('results/score/{}.preds.txt'.format(name)).open('wb') as f:
        test_inpf = functools.partial(input_fn, fwords(name), ftags(name))
        golds_gen = generator_fn(fwords(name), ftags(name))
        preds_gen = estimator.predict(test_inpf)
        for golds, preds in zip(golds_gen, preds_gen):
            (words, tag) =
            f.write' '.join([tag, preds['labels'], b''.join(words)]) + b'\n')


for name in ['train', 'eval']:
    write_predictions(name)

定义了一个名为`write_predictions`的函数，用于将预测结果写入文件。先创建保存预测结果的目录`results/score`，然后使用`Path`打开文件`results/score/{}.preds.txt`进行写操作。在循环中，通过部分应用的方式创建输入函数`test_inpf`，并分别生成原始标签和预测结果的生成器。通过遍历这两个生成器，将数据写入文件中。

最后，对`['train', 'eval']`中的每个名称调用`write_predictions`函数，分别生成训练集和评估集的预测结果文件。

模型效果输出：

precision    recall  f1-score   support

         POS       0.91      0.87      0.89       400
         NEG       0.88      0.91      0.89       400

   micro avg       0.89      0.89      0.89       800
   macro avg       0.89      0.89      0.89       800
weighted avg       0.89      0.89      0.89       800

上一篇：0507打卡

下一篇：0510打卡

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

0509打卡

0509打卡

中文情感分析

code environment

语料的准备

词向量的准备

模型：CNN

结构：

模型主要代码解析

数据预处理

这个函数定义了输入数据的函数。它接受words_path和tags_path作为输入参数，params用于指定其他参数，默认为None。

定义了数据样本的形状（shapes），类型（types）和默认值（defaults）。每个样本由一个长度可变的字符串列表和一个字符串标签组成。

如果shuffle_repeat为True，则对数据集进行洗牌和重复操作。使用shuffle方法对数据集进行洗牌，并设置缓冲区大小为params['buffer']。然后使用repeat方法对数据集进行重复操作，重复次数为params['epochs']。

返回处理后的数据集。

模型定义

定义了模型函数model_fn，它接收features，labels，mode和params作为输入参数。如果features是一个字典，则将其赋值为features['words']。

使用Path打开文件params['tags']，并遍历其中的行，获取每个标签的索引。计算得到标签的数量num_tags。

使用单词索引表vocab_words将输入的features转换预训练的词向量w2v，并创建一个tf.Variable变量w2v_var保存词向量。使用tf.nn.embedding_lookup根据单词ID查找对应的词向量，并得到嵌入矩阵embeddings。

使用tf.layers.dropout对嵌入矩阵embeddings进行丢弃操作，以减少过拟合。然后，使用tf.expand_dims在最后一维上添加一个维度。

使用全连接层tf.layers.dense将输出转换为具有num_tags个输出节点的向量。使用tf.argmax在axis=1上找出预测值索引，得到最终的预测标签索引pred_ids。

定义了评估指标metrics，包括准确率（'acc'），精确度（'precision'）和召回率（'recall'）。使用tf.metrics.accuracy、tf.metrics.precision和tf.metrics.recall`计算这些指标，并将其添加到TensorBoard的摘要中。

训练过程

使用Path打开文件'../../../Mutual-AI/model/chinese_sentiment_analysis/params.json'，以写式打开文件。然后使用json.dump以JSON格式写入文件中，缩进为4个空格，按键排序。

定义了一个名为fwords的函数，它接收一个字符串参数name，并返回使用name构造路径的结果。路径由DATA_DIR和{}.words.txt组成，其中{}将被替换为函数的参数name。

定义了一个名为ftags的，它接收一个字符串参数name，并返回使用name构造路径的结果。路径由{}将被替换为函数的参数name。

最后，对['train', 'eval']中的每个名称调用write_predictions函数，分别生成训练集和评估集的预测结果文件。

模型效果输出：

51CTO博客

这个函数定义了输入数据的函数。它接受`words_path`和`tags_path`作为输入参数，`params`用于指定其他参数，默认为`None`。

定义了数据样本的形状（`shapes`），类型（`types`）和默认值（`defaults`）。每个样本由一个长度可变的字符串列表和一个字符串标签组成。

如果`shuffle_repeat`为`True`，则对数据集进行洗牌和重复操作。使用`shuffle`方法对数据集进行洗牌，并设置缓冲区大小为`params['buffer']`。然后使用`repeat`方法对数据集进行重复操作，重复次数为`params['epochs']`。

定义了模型函数`model_fn`，它接收`features`，`labels`，`mode`和`params`作为输入参数。如果`features`是一个字典，则将其赋值为`features['words']`。

使用`Path`打开文件`params['tags']`，并遍历其中的行，获取每个标签的索引。计算得到标签的数量`num_tags`。

使用单词索引表`vocab_words`将输入的`features`转换预训练的词向量`w2v`，并创建一个`tf.Variable`变量`w2v_var`保存词向量。使用`tf.nn.embedding_lookup`根据单词ID查找对应的词向量，并得到嵌入矩阵`embeddings`。

使用`tf.layers.dropout`对嵌入矩阵`embeddings`进行丢弃操作，以减少过拟合。然后，使用`tf.expand_dims`在最后一维上添加一个维度。

使用全连接层`tf.layers.dense`将输出转换为具有`num_tags`个输出节点的向量。使用`tf.argmax`在`axis=1`上找出预测值索引，得到最终的预测标签索引`pred_ids`。

定义了评估指标`metrics，包括准确率（'acc'），精确度（'precision'）和召回率（'recall'）。使用`tf.metrics.accuracy`、`tf.metrics.precision`和`tf.metrics.recall`计算这些指标，并将其添加到TensorBoard的摘要中。

使用`Path`打开文件`'../../../Mutual-AI/model/chinese_sentiment_analysis/params.json'`，以写式打开文件。然后使用`json.dump`以JSON格式写入文件中，缩进为4个空格，按键排序。

定义了一个名为`fwords`的函数，它接收一个字符串参数`name`，并返回使用`name`构造路径的结果。路径由`DATA_DIR`和`{}.words.txt`组成，其中`{}`将被替换为函数的参数`name`。

定义了一个名为`ftags`的，它接收一个字符串参数`name`，并返回使用`name`构造路径的结果。路径由`{}`将被替换为函数的参数`name`。

最后，对`['train', 'eval']`中的每个名称调用`write_predictions`函数，分别生成训练集和评估集的预测结果文件。