keras-文本序列_文本向量化_分词（二）(使用预训练的词嵌入)

原创

六mo神剑 2022-07-18 15:10:43 博主文章分类：神经网络 ©著作权

©著作权归作者所有：来自51CTO博客作者六mo神剑的原创作品，请联系作者获取转载授权，否则将追究法律责任

Keras-文本序列_文本向量化（二）(使用预训练的词嵌入)

小结
‰ 将原始文本转换为神经网络能够处理的格式。
‰ 使用 Keras 模型的 Embedding 层来学习针对特定任务的标记嵌入。
‰ 使用预训练词嵌入在小型自然语言处理问题上获得额外的性能提升。

代码清单 6-8 处理 IMDB 原始数据的标签
代码清单 6-9 对 IMDB 原始数据的文本进行分词
代码清单 6-10　解析 GloVe 词嵌入文件
代码清单 6-11　准备 GloVe 词嵌入矩阵
代码清单 6-12　模型定义
代码清单 6-14　训练与评估
代码清单 6-15　绘制结果

也可以在不加载预训练词嵌入、也不冻结嵌入层的情况下训练相同的模型。在这种情况下，你将会学到针对任务的输入标记的嵌入。如果有大量的可用数据，这种方法通常比预训练词嵌入更加强大，但本例只有 200 个训练样本。我们来试一下这种方法

代码清单 6-16　在不使用预训练词嵌入的情况下，训练相同的模型
代码清单 6-17　对测试集数据进行分词
代码清单 6-18　在测试集上评估模型

2. 使用预训练的词嵌入
有时可用的训练数据很少，以至于只用手头数据无法学习适合特定任务的词嵌入。那么应该怎么办？
你可以从预计算的嵌入空间中加载嵌入向量（你知道这个嵌入空间是高度结构化的，并且具有有用的属性，即抓住了语言结构的一般特点），而不是在解决问题的同时学习词嵌入。在自然语言处理中使用预训练的词嵌入，其背后的原理与在图像分类中使用预训练的卷积神经网络是一样的：没有足够的数据来自己学习真正强大的特征，但你需要的特征应该是非常通用的，比如常见的视觉特征或语义特征。在这种情况下，重复使用在其他问题上学到的特征，这种做法是有道理的。
这种词嵌入通常是利用词频统计计算得出的（观察哪些词共同出现在句子或文档中），用到的技术很多，有些涉及神经网络，有些则不涉及。 Bengio 等人在 21 世纪初首先研究了一种思路，
就是用无监督的方法计算一个密集的低维词嵌入空间 a，但直到最有名且最成功的词嵌入方案之
一 word2vec 算法发布之后，这一思路才开始在研究领域和工业应用中取得成功。 word2vec 算法
由 Google 的 Tomas Mikolov 于 2013 年开发，其维度抓住了特定的语义属性，比如性别

import keras
keras.__version__


# 在真实的词嵌入空间中，常见的有意义的几何变换的例子包括“性别”向量和“复数”向量。
# 例如，将 king（国王）向量加上 female（女性）向量，得到的是 queen（女王）向量。将 king（国王）
# 向量加上 plural（复数）向量，得到的是 kings 向量。词嵌入空间通常具有几千个这种可解释的、
# 并且可能很有用的向量。
# 有没有一个理想的词嵌入空间，可以完美地映射人类语言，并可用于所有自然语言处理任
# 务？可能有，但我们尚未发现。此外，也不存在人类语言（human language）这种东西。世界上

# 有许多种不同的语言，而且它们不是同构的，因为语言是特定文化和特定环境的反射。但从更
# 实际的角度来说，一个好的词嵌入空间在很大程度上取决于你的任务。英语电影评论情感分析
# 模型的完美词嵌入空间，可能不同于英语法律文档分类模型的完美词嵌入空间，因为某些语义
# 关系的重要性因任务而异。
# 因此，合理的做法是对每个新任务都学习一个新的嵌入空间。幸运的是，反向传播让这种
# 学习变得很简单，而 Keras 使其变得更简单。我们要做的就是学习一个层的权重，这个层就是
# Embedding 层。
# 代码清单 6-5 将一个 Embedding 层实例化


from keras.layers import Embedding

# The Embedding layer takes at least two arguments:
# the number of possible tokens, here 1000 (1 + maximum word index),
# and the dimensionality of the embeddings, here 64.

# Embedding 层至少需要两个参数：标记的个数（这里是 1000，即最大单词索引 +1）和嵌入的维度（这里是 64）
# Embedding 层的输入是一个二维整数张量，其形状为 (samples, sequence_length)，
# 每个元素是一个整数序列。它能够嵌入长度可变的序列，例如，对于前一个例子中的
# Embedding 层，你可以输入形状为 (32, 10)（32 个长度为 10 的序列组成的批量）或 (64,
# 15)（64 个长度为 15 的序列组成的批量）的批量。不过一批数据中的所有序列必须具有相同的
# 长度（因为需要将它们打包成一个张量），所以较短的序列应该用 0 填充，较长的序列应该被截断。
# 这 个 Embedding 层 返 回 一 个 形 状 为 (samples, sequence_length, embedding_
# dimensionality) 的三维浮点数张量。然后可以用 RNN 层或一维卷积层来处理这个三维张量
# （二者都会在后面介绍）。
# 将一个 Embedding 层实例化时，它的权重（即标记向量的内部字典）最开始是随机的，与
# 其他层一样。在训练过程中，利用反向传播来逐渐调节这些词向量，改变空间结构以便下游模
# 型可以利用。一旦训练完成，嵌入空间将会展示大量结构，这种结构专门针对训练模型所要解
# 决的问题。

embedding_layer = Embedding(1000, 64)

# 将电影评论限制为前 10 000 个最常见的单词（第一次处理这个数据集时就是这么做的），
# 然后将评论长度限制为只有 20 个单词。对于这 10 000 个单词，网络将对每个词都学习一个 8
# 维嵌入，将输入的整数序列（二维整数张量）转换为嵌入序列（三维浮点数张量），然后将这个
# 张量展平为二维，最后在上面训练一个 Dense 层用于分类。
# Embedding 层至少需要两个参数：
# 标记的个数（这里是 1000，即最
# 大单词索引 +1）和嵌入的维度（这
# 里是 64）154　　第 6 章　深度学习用于文本和序列
# 代码清单 6-6 加载 IMDB 数据，准备用于 Embedding 层

from keras.datasets import imdb
from keras import preprocessing

# Number of words to consider as features
# 作为特征的单词个数
max_features = 10000
# Cut texts after this number of words 
# (among top max_features most common words)

# 在这么多单词后截断文本（这些单词都属于前 max_features 个最常见的单词）

maxlen = 20


# Load the data as lists of integers.

# 将整数列表转换成形状为(samples,maxlen) 的二维整数张量

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

# 代码清单 6-7 在 IMDB 数据上使用 Embedding 层和分类器

from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs

# 指定 Embedding 层的最大输入长度，以便后面将嵌入输入展平。 Embedding 层激活的形状为 (samples, maxlen, 8)

model.add(Embedding(10000, 8, input_length=maxlen))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`

# 将三维的嵌入张量展平成形状为 (samples, maxlen * 8) 的二维张量

model.add(Flatten())

# We add the classifier on top
# 在上面添加分类器
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

# 得到的验证精度约为 76%，考虑到仅查看每条评论的前 20 个单词，这个结果还是相当不错
# 的。但请注意，仅仅将嵌入序列展开并在上面训练一个 Dense 层，会导致模型对输入序列中的
# 每个单词单独处理，而没有考虑单词之间的关系和句子结构（举个例子，这个模型可能会将 this
# movie is a bomb 和 this movie is the bomb 两条都归为负面评论 a）。更好的做法是在嵌入序列上添
# 加循环层或一维卷积层，将每个序列作为整体来学习特征。这也是接下来几节的重点


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 2s - loss: 0.6560 - acc: 0.6482 - val_loss: 0.5906 - val_acc: 0.7146
Epoch 2/10
20000/20000 [==============================] - 2s - loss: 0.5189 - acc: 0.7595 - val_loss: 0.5117 - val_acc: 0.7364
Epoch 3/10
20000/20000 [==============================] - 2s - loss: 0.4512 - acc: 0.7933 - val_loss: 0.4949 - val_acc: 0.7470
Epoch 4/10
20000/20000 [==============================] - 2s - loss: 0.4190 - acc: 0.8069 - val_loss: 0.4905 - val_acc: 0.7538
Epoch 5/10


# 这种词嵌入通常是利用词频统计计算得出的（观察哪些词共同出现在句子或文档中），用到
# 的技术很多，有些涉及神经网络，有些则不涉及。 Bengio 等人在 21 世纪初首先研究了一种思路，
# 就是用无监督的方法计算一个密集的低维词嵌入空间 a，但直到最有名且最成功的词嵌入方案之
# 一 word2vec 算法发布之后，这一思路才开始在研究领域和工业应用中取得成功。 word2vec 算法
# 由 Google 的 Tomas Mikolov 于 2013 年开发，其维度抓住了特定的语义属性，比如性别

import os

imdb_dir = '/home/ubuntu/data/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
                
                
# 2. 对数据进行分词
# 利用本节前面介绍过的概念，我们对文本进行分词，并将其划分为训练集和验证集。因为
# 预训练的词嵌入对训练数据很少的问题特别有用（否则，针对于具体任务的嵌入可能效果更好），
# 所以我们又添加了以下限制：将训练数据限定为前 200 个样本。因此，你需要在读取 200 个样
# 本之后学习对电影评论进行分类。
# 代码清单 6-9 对 IMDB 原始数据的文本进行分词             
Tokenize the data
Let's vectorize the texts we collected, and prepare a training and validation split. We will merely be using the concepts we introduced earlier in this section.

Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise, task-specific embeddings are likely to outperform them), we will add the following twist: we restrict the training data to its first 200 samples. So we will be learning to classify movie reviews after looking at just 200 examples...

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# 在 100 个单词后截断评论
maxlen = 100  # We will cut reviews after 100 words
# 在 200 个样本上训练
training_samples = 200  # We will be training on 200 samples

# 在 10 000 个样本上验证
validation_samples = 10000  # We will be validating on 10000 samples

# 只考虑数据集中前 10 000 个最常见的单词
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).

# 将数据划分为训练集和验证集，但首先要打乱数据，因为一开始数据中的样本是排好序的（所有负面评论都在前面，然后是所有正面评论）

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# 在 100 个单词后截断评论
maxlen = 100  # We will cut reviews after 100 words
# 在 200 个样本上训练
training_samples = 200  # We will be training on 200 samples

# 在 10 000 个样本上验证
validation_samples = 10000  # We will be validating on 10000 samples

# 只考虑数据集中前 10 000 个最常见的单词
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).

# 将数据划分为训练集和验证集，但首先要打乱数据，因为一开始数据中的样本是排好序的（所有负面评论都在前面，然后是所有正面评论）

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)
Download the GloVe word embeddings
Head to https://nlp.stanford.edu/projects/glove/ (where you can learn more about the GloVe algorithm), and download the pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.

Pre-process the embeddings
Let's parse the un-zipped file (it's a txt file) to build an index mapping words (as strings) to their vector representation (as number vectors).

3. 下载 GloVe 词嵌入
# 打开 https://nlp.stanford.edu/projects/glove，下载 2014 年英文维基百科的预计算嵌入。这是
# 一个 822 MB 的压缩文件，文件名是 glove.6B.zip，里面包含 400 000 个单词（或非单词的标记）
# 的 100 维嵌入向量。解压文件。
# 4. 对嵌入进行预处理
# 我们对解压后的文件（一个 .txt 文件）进行解析，构建一个将单词（字符串）映射为其向
# 量表示（数值向量）的索引。
# 代码清单 6-10　解析 GloVe 词嵌入文件
# 3. 下载 GloVe 词嵌入
# 打开 https://nlp.stanford.edu/projects/glove，下载 2014 年英文维基百科的预计算嵌入。这是
# 一个 822 MB 的压缩文件，文件名是 glove.6B.zip，里面包含 400 000 个单词（或非单词的标记）
# 的 100 维嵌入向量。解压文件。
# 4. 对嵌入进行预处理
# 我们对解压后的文件（一个 .txt 文件）进行解析，构建一个将单词（字符串）映射为其向
# 量表示（数值向量）的索引。
# 代码清单 6-10　解析 GloVe 词嵌入文件

glove_dir = '/home/ubuntu/data/'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
Found 400000 word vectors.
Now let's build an embedding matrix that we will be able to load into an Embedding layer. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 is not supposed to stand for any word or token -- it's a placeholder.

# 接下来，需要构建一个可以加载到 Embedding 层中的嵌入矩阵。它必须是一个形状为
# (max_words, embedding_dim) 的矩阵，对于单词索引（在分词时构建）中索引为 i 的单词，
# 这个矩阵的元素 i 就是这个单词对应的 embedding_dim 维向量。注意，索引 0 不应该代表任何
# 单词或标记，它只是一个占位符。
# 代码清单 6-11　准备 GloVe 词嵌入矩阵

embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
#             嵌入索引（embeddings_index） 中找不到的词，其嵌入向量全为 0
            embedding_matrix[i] = embedding_vector
Define a model
We will be using the same model architecture as before:

5. 定义模型
# 我们将使用与前面相同的模型架构。
# 代码清单 6-12　模型定义
# 5. 定义模型
# 我们将使用与前面相同的模型架构。
# 代码清单 6-12　模型定义

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_3 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Load the GloVe embeddings in the model
The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with index i. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding layer, the first layer in our model:

6. 在模型中加载 GloVe 嵌入
# Embedding 层只有一个权重矩阵，是一个二维的浮点数矩阵，其中每个元素 i 是与索引 i
# 相关联的词向量。够简单。将准备好的 GloVe 矩阵加载到 Embedding 层中，即模型的第一层。
# 代码清单 6-13　将预训练的词嵌入加载到 Embedding 层中
# 6. 在模型中加载 GloVe 嵌入
# Embedding 层只有一个权重矩阵，是一个二维的浮点数矩阵，其中每个元素 i 是与索引 i
# 相关联的词向量。够简单。将准备好的 GloVe 矩阵加载到 Embedding 层中，即模型的第一层。
# 代码清单 6-13　将预训练的词嵌入加载到 Embedding 层中

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
Additionally, we freeze the embedding layer (we set its trainable attribute to False), following the same rationale as what you are already familiar with in the context of pre-trained convnet features: when parts of a model are pre-trained (like our Embedding layer), and parts are randomly initialized (like our classifier), the pre-trained parts should not be updated during training to avoid forgetting what they already know. The large gradient update triggered by the randomly initialized layers would be very disruptive to the already learned features.

Train and evaluate
Let's compile our model and train it:

此外，需要冻结 Embedding 层（即将其 trainable 属性设为 False），其原理和预训练的卷
# 积神经网络特征相同，你已经很熟悉了。如果一个模型的一部分是经过预训练的（如 Embedding
# 层），而另一部分是随机初始化的（如分类器），那么在训练期间不应该更新预训练的部分，以
# 避免丢失它们所保存的信息。随机初始化的层会引起较大的梯度更新，会破坏已经学到的特征。
# 7. 训练模型与评估模型
# 编译并训练模型。
# 代码清单 6-14　训练与评估
# 此外，需要冻结 Embedding 层（即将其 trainable 属性设为 False），其原理和预训练的卷
# 积神经网络特征相同，你已经很熟悉了。如果一个模型的一部分是经过预训练的（如 Embedding
# 层），而另一部分是随机初始化的（如分类器），那么在训练期间不应该更新预训练的部分，以
# 避免丢失它们所保存的信息。随机初始化的层会引起较大的梯度更新，会破坏已经学到的特征。
# 7. 训练模型与评估模型
# 编译并训练模型。
# 代码清单 6-14　训练与评估

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
Train on 200 samples, validate on 10000 samples
Epoch 1/10
200/200 [==============================] - 1s - loss: 1.9075 - acc: 0.5050 - val_loss: 0.7027 - val_acc: 0.5102
Epoch 2/10
200/200 [==============================] - 0s - loss: 0.7329 - acc: 0.7100 - val_loss: 0.8200 - val_acc: 0.5000
Epoch 3/10
200/200 [==============================] - 0s - loss: 0.4876 - acc: 0.7400 - val_loss: 0.6917 - val_acc: 0.5616
Epoch 4/10
200/200 [==============================] - 0s - loss: 0.3640 - acc: 0.8400 - val_loss: 0.7005 - val_acc: 0.5557
Epoch 5/10
200/200 [==============================] - 0s - loss: 0.2673 - acc: 0.8950 - val_loss: 1.2560 - val_acc: 0.4999
Epoch 6/10
200/200 [==============================] - 0s - loss: 0.1936 - acc: 0.9400 - val_loss: 0.7294 - val_acc: 0.5704
Epoch 7/10
200/200 [==============================] - 0s - loss: 0.2455 - acc: 0.8800 - val_loss: 0.7187 - val_acc: 0.5659
Epoch 8/10
200/200 [==============================] - 0s - loss: 0.0591 - acc: 0.9950 - val_loss: 0.7393 - val_acc: 0.5723
Epoch 9/10
200/200 [==============================] - 0s - loss: 0.0399 - acc: 1.0000 - val_loss: 0.8691 - val_acc: 0.5522
Epoch 10/10
200/200 [==============================] - 0s - loss: 0.0283 - acc: 1.0000 - val_loss: 0.9322 - val_acc: 0.5413
Let's plot its performance over time:

接下来，绘制模型性能随时间的变化（见图 6-5 和图 6-6）。
# 代码清单 6-15　绘制结果
# 接下来，绘制模型性能随时间的变化（见图 6-5 和图 6-6）。
# 代码清单 6-15　绘制结果

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

keras-文本序列_文本向量化_分词（二）(使用预训练的词嵌入)_数据

# 模型很快就开始过拟合，考虑到训练样本很少，这一点也不奇怪。出于同样的原因，验证
# 精度的波动很大，但似乎达到了接近 60%。160　　第 6 章　深度学习用于文本和序列
# 注意，你的结果可能会有所不同。训练样本数太少，所以模型性能严重依赖于你选择的
# 200 个样本，而样本是随机选择的。如果你得到的结果很差，可以尝试重新选择 200 个不同的
# 随机样本，你可以将其作为练习（在现实生活中无法选择自己的训练数据）。
# 你也可以在不加载预训练词嵌入、也不冻结嵌入层的情况下训练相同的模型。在这种情况下，
# 你将会学到针对任务的输入标记的嵌入。如果有大量的可用数据，这种方法通常比预训练词嵌
# 入更加强大，但本例只有 200 个训练样本。我们来试一下这种方法（见图 6-7 和图 6-8）。
# 代码清单 6-16　在不使用预训练词嵌入的情况下，训练相同的模型

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
flatten_4 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                320032    
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Train on 200 samples, validate on 10000 samples
Epoch 1/10
200/200 [==============================] - 1s - loss: 0.6941 - acc: 0.4750 - val_loss: 0.6920 - val_acc: 0.5213
Epoch 2/10
200/200 [==============================] - 0s - loss: 0.5050 - acc: 0.9900 - val_loss: 0.6949 - val_acc: 0.5138
Epoch 3/10
200/200 [==============================] - 0s - loss: 0.2807 - acc: 1.0000 - val_loss: 0.7131 - val_acc: 0.5125


acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

keras-文本序列_文本向量化_分词（二）(使用预训练的词嵌入)_数据_02

# 验证精度停留在 50% 多一点。因此，在本例中，预训练词嵌入的性能要优于与任务一起学
# 习的嵌入。如果增加样本数量，情况将很快发生变化，你可以把它作为一个练习。
# 最后，我们在测试数据上评估模型。首先，你需要对测试数据进行分词。
# 代码清单 6-17　对测试集数据进行分词

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)
And let's load and evaluate the first model:

# 接下来，加载并评估第一个模型。
# 代码清单 6-18　在测试集上评估模型

model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

# 测试精度达到了令人震惊的 56% ！只用了很少的训练样本，得到这样的结果很不容易。
24736/25000 [============================>.] - ETA: 0s
[0.93747248332977295, 0.53659999999999997]
We get an appalling test accuracy of 54%. Working with just a handful of training samples is