学习总结
(1)关于5个assignment的难度,可以参考斯坦福大佬的CS224作业评论,大体是说今年的transformer成为课程重点,由总助教博三大佬John讲,他原本还想让同学们手写encoder-decoder(后来同学们接受不了就取消了hhh),assignment5最难,需要分别训练一个vanilla模型和预训练模型、比较结果;前三个assignment则是和往年一样,4和5是2021年新加的。
(2)作业1里大家简单探索了词向量的性质;作业2里我们推导了训练词向量的公式(这是这节课最calculus-intensive的作业);作业3算是唯一一个涉及比较传统的语言学概念与算法的作业,是关于 Dependency Parsing(依存句法分析)的。作业4是搭建一个机器翻译模型,只是目标语言变成了Cherokee(美国原住民的语言之一)。
(3)作业5是今年紧跟NLP大趋势,“重磅”新推出的:在数学部分,我们探索了Multi-head Attention的性质;在编程部分,我们需要复现一些预训练数据处理的代码(span corruption),以及实现Attention的一个变种。
(4)stanford nlp组只有4~7名教授,CMU有30+,除了NLP以外,机器翻译、问答系统、搜索引擎等等,都有专门的课。stanford是一年4学期(每学期10/11周),所以课程里很多任务(如信息抽取、对话系统)来不及涉及。由于时间限制、科技趋势,课程里偏语言学的概念也越来越少。《数学之美》中的信息论、隐马尔可夫、TF-IDF、分词等概念没在224n涉及,在BERT时代变得不太相关,技术迭代也确实快。
文章目录
一、回顾依存句法分析
很多问题都可以转为分类问题,基于转移的依存句法分析器就由预测树结构问题转为预测动作序列问题。
有一种方法:
编码端:用来负责计算词的隐层向量表示
解码端:用来解码计算当前状态的所有动作得分
在本次assignment3中,斯坦福大学提供了句法分析的数据源文件(人工标注形成)。
训练集的数据源train.conll
如下:
1 In _ ADP IN _ 5 case _ _
2 an _ DET DT _ 5 det _ _
3 Oct. _ PROPN NNP _ 5 compound _ _
4 19 _ NUM CD _ 5 nummod _ _
5 review _ NOUN NN _ 45 nmod _ _
6 of _ ADP IN _ 9 case _ _
7 `` _ PUNCT `` _ 9 punct _ _
8 The _ DET DT _ 9 det _ _
9 Misanthrope _ NOUN NN _ 5 nmod _ _
10 '' _ PUNCT '' _ 9 punct _ _
........
文件中每个句子以换行,开头是句子序号,每行有10列(除去句子序号),分别的含义如下(和验证集、测试集也是一样的):
- ID:单词索引,每个新句子从1开始的整数;可能是多个词的标记的范围。
- FORM:Word单词或标点符号。
- LEMMA:词形的词条或词干。
- UPOSTAG:从Google通用POS标签的修订版本中提取的通用词性标签。
- XPOSTAG:语言特定的词性标签;下划线如果不可用。
- FEATS:来自通用特征清单或来自定义的语言特定扩展的形态特征列表;下划线如果不可用。
- HEAD:当前令牌的头部,它是ID的值或零(0)。
- DEPREL:通用斯坦福与HEAD(root iff HEAD = 0)的依赖关系或者定义的语言特定的子类型之一。
- DEPS:二级依赖项列表(head-deprel对)。
- MISC:任何其他注释。
举例解释:
(1)第4列词性解释的说明,例如:
NNP: noun, proper, singular 名词,单数
VBZ: verb, present tense,3rd person singular 动词,一般现在时第三人称单数
(2)第7列依赖关系的说明,例如:
nsubj : nominal subject,名词主语
dobj : direct object直接宾语
punct: punctuation标点符号
还有一个文件是en-cw.txt
词向量:
二、CS224n作业要求
Neural Transition-Based Dependency Parsing (44 points)
作业:基于神经网络,转移的依存解析器
目标:最大化UAS值(Unlabeled attachment score)
首先看清楚作业中的readme
文件,确保有local_env.yml
文件中所有的依赖项:
# 1. Activate your old environment:
conda activate cs224n
# 2. Install docopt
conda install docopt
# 3. Install pytorch, torchvision, and tqdm
conda install pytorch torchvision -c pytorch
conda install -c anaconda tqdm
如果想创建一个新虚拟环境,则:
# 1. Create an environment with dependencies specified in local_env.yml
# (note that this can take some time depending on your laptop):
conda env create -f local_env.yml
# 2. Activate the new environment:
conda activate cs224n_a3
# To deactivate an active environment, use
conda deactivate
依存句法解析器基于依存句法分析,处理句子结构,可以参考上一讲,依存句法解析器有基于转移的、基于图的、基于特征等等。本次作业是要求基于转移的依存句法解析器。
At every step it maintains a partial parse, which is represented as follows:
- A stack of words that are currently being processed.
- A buffer of words yet to be processed.
- A list of dependencies predicted by the parser
初始,栈中值有root,依赖项列表(队列)为空,缓冲区按顺序包含句子中所有单词。每次操作,解析器进行一次转换,以此类推,直到缓冲区队列为空:
(1)Shift:将缓冲区队列的元素入栈
(2)LEFT-ARC:将栈顶的两棵依存子树采用左弧合并;
(3)RIGHT-ARC:将栈顶的两棵依存子树采用右弧合并;
即每次操作,用依存句法解析器,作为一个分类器,求出三个动作的最大概率的那个,再进行对应动作的操作。
图源自 车万翔《基于预训练模型的自然语言处理》
注意:在课程给出初始代码parser_transitions.py
文件中的unidirectional_predict
函数的return [("RA" if pp.stack[1] is "right" else "LA") if len(pp.buffer) == 0 else "S"
,需要将is
改为==
,否则会报错:
SyntaxWarning: "is" with a literal. Did you mean "=="?
return [("RA" if pp.stack[1] is "right" else "LA") if len(pp.buffer) == 0 else "S"
三、具体题目
第1题
(1)(4分)句子:I parsed this sentence correctly.
问:用了什么转换,添加了什么新依赖项(如果有的话),下面给了前三个步骤:
answer:
即依存句法树为:
第2题
(2)(2分)一个含有n个单词的句子,一共有多少步解析,用1-2句话简要解释。
答:2n步。因为句子中的每个单词在从堆栈中删除之前需要两次转换:SHIFT(入栈)和一个arc(旋转操作)。解析过程中的每一步只能为一个单词执行这两种转换中的一种。
第3题
(3)(6分)实现parser_transitions.py
的PartialParse
类中的构造函数__init__
和parse_step
函数,即实现解析器的转换机制,可以运行python parser_transitions.py part_c
来测试。
1)首先是parser_transitions.py
的PartialParse
类中的构造函数__init__
:
这里的依存列表dependencies
是元素为tuple元组的一个列表list,其中每个元组都表示一个依赖关系,即(head, dependent)
。
注意:
(1)根结点规定是root结点。
(2)If you need to use the sentence object to initialize anything, make sure to not directly reference the sentence object. That is, remember to NOT modify the sentence object.
class PartialParse(object):
def __init__(self, sentence):
"""Initializes this partial parse.
@param sentence (list of str): The sentence to be parsed as a list of words.
Your code should not modify the sentence.
"""
# The sentence being parsed is kept for bookkeeping purposes. Do NOT alter it in your code.
self.sentence = sentence
### YOUR CODE HERE (3 Lines)
### Your code should initialize the following fields:
### self.stack: The current stack represented as a list with the top of the stack as the
### last element of the list.
### self.buffer: The current buffer represented as a list with the first item on the
### buffer as the first item of the list
### self.dependencies: The list of dependencies produced so far. Represented as a list of
### tuples where each tuple is of the form (head, dependent).
### Order for this list doesn't matter.
###
### Note: The root token should be represented with the string "ROOT"
### Note: If you need to use the sentence object to initialize anything, make sure to not directly
### reference the sentence object. That is, remember to NOT modify the sentence object.
self.stack = ['ROOT']
self.buffer = sentence.copy() # shallow copy 浅拷贝
self.dependencies = []
### END YOUR CODE
2)接着是还是parser_transitions.py
文件,其中的的PartialParse
类中的parse_step
函数:
这里的stack.pop(-2)
指将stack内的倒数第二个元素出栈(虽然有点违背栈的定义hh),举个栗子:
stack = [1]
stack.append(2)
stack.append(3)
print(stack)
# 即stack列表中的最后一个元素为3
print("栈的最后一个元素为:", stack[-1])
# 让倒数第二个元素出栈,返回的是所要出栈的元素值
# 此处倒数第二个元素为2
print(stack.pop(-2))
# 重新打印一次栈内的元素
print(stack)
上面的结果为:
回到这题的解法:
def parse_step(self, transition):
"""Performs a single parse step by applying the given transition to this partial parse
@param transition (str): A string that equals "S", "LA", or "RA" representing the shift,
left-arc, and right-arc transitions. You can assume the provided
transition is a legal transition.
"""
### YOUR CODE HERE (~7-12 Lines)
### TODO:
### Implement a single parsing step, i.e. the logic for the following as
### described in the pdf handout:
### 1. Shift
### 2. Left Arc
### 3. Right Arc
if transition == 'S':
# self.stack.append(self.buffer.pop(0)) # 等价于下面两句
self.stack.append(self.buffer[0])
self.buffer.pop(0)
elif transition == 'LA':
# self.dependencies.append((self.stack[-1], self.stack.pop(-2)))
# 上面这句等价于下面这两句,即前者指向后者词,
self.dependencies.append((self.stack[-1], self.stack[-2]))
self.stack.pop(-2)
elif transition == 'RA':
# self.dependencies.append((self.stack[-2], self.stack.pop(-1)))
self.dependencies.append((self.stack[-2], self.stack[-1]))
self.stack.pop(-1)
### END YOUR CODE
3)对于PartialParse
函数,我们可以用句子进行解析测试:
# 输入要解析的句子:
sentence = ["parse", "this", "sentence"]
# 传入进行转换解析的操作列表:
dependencies = PartialParse(sentence).parse(["S", "S", "S", "LA", "RA", "RA"])
# 对解析以后的依存关系排序:
dependencies = tuple(sorted(dependencies))
# 期望解析成功的依存关系:
expected = (('ROOT', 'parse'), ('parse', 'sentence'), ('sentence', 'this'))
栗子中的结果。从经过PartialParse
函数解析后的dependencies
变量值(排序后的)看出,和我们期望解析成功的依存关系expected
是一毛一样的。这里的dependencies
是元素为tuple元组的一个列表list,其中每个元组都表示一个依赖关系,即(head, dependent)
。
第4题:多个句子解析
class DummyModel类:
先把句子放到buffer缓存区队列里面,DummyModel predict方法创建转换操作:如果队列中仍有元素,就执行shift操作,将队列中的元素一个个送到stack栈中准备PK;如果队列中无元素了,意味着队列中的元素都入站了要进行PK,如栈中第一个元素(最先入栈的元素是right,DummyModel设置为RA,否则为LA操作(left的情况))。
可以先对minibatch_parse
函数进行测试:
# 输入要解析的多个句子列表:
sentences = [["right", "arcs", "only"],
["right", "arcs", "only", "again"],
["left", "arcs", "only"],
["left", "arcs", "only", "again"]]
# 批次解析:
# DummyModel()模型中提供要转换的动作。2是批次大小batch_size
deps = minibatch_parse(sentences, DummyModel(), 2)
# 期望解析的依存关系:
#deps[0]:(('ROOT', 'right'), ('arcs', 'only'), ('right', 'arcs')))
#deps[1]: (('ROOT', 'right'), ('arcs', 'only'), ('only', 'again'), ('right', 'arcs')))
#deps[2]: (('only', 'ROOT'), ('only', 'arcs'), ('only', 'left')))
#deps[3]: (('again', 'ROOT'), ('again', 'arcs'), ('again', 'left'), ('again', 'only')))
可以看到多个句子列表进行解析后的结果deps
值(如下),和我们期望的解析后的依存关系(上面的注释部分)相同:
(4)(8分)当然可以每次分类器预测每次该执行的操作是哪个,但为了更高效,可以一次预测多次该执行的操作,即用下面的算法进行小批量解析:
实现parser_transitions.py
文件的minibatch_parse
函数,可以用python parser_transitions.py part_d
进行测试。
def minibatch_parse(sentences, model, batch_size):
"""Parses a list of sentences in minibatches using a model.
@param sentences (list of list of str): A list of sentences to be parsed
(each sentence is a list of words and each word is of type string)
@param model (ParserModel): The model that makes parsing decisions. It is assumed to have a function
model.predict(partial_parses) that takes in a list of PartialParses as input and
returns a list of transitions predicted for each parse. That is, after calling
transitions = model.predict(partial_parses)
transitions[i] will be the next transition to apply to partial_parses[i].
@param batch_size (int): The number of PartialParses to include in each minibatch
@return dependencies (list of dependency lists):列表中的每个元素是对应句子的依赖项列表(顺序对应)
"""
dependencies = []
### YOUR CODE HERE (~8-10 Lines)
### TODO:
### Implement the minibatch parse algorithm. Note that the pseudocode for this algorithm is given in the pdf handout.
###
### Note: A shallow copy (as denoted in the PDF) can be made with the "=" sign in python, e.g.
### unfinished_parses = partial_parses[:].
### Here `unfinished_parses` is a shallow copy of `partial_parses`.
### In Python, a shallow copied list like `unfinished_parses` does not contain new instances
### of the object stored in `partial_parses`. Rather both lists refer to the same objects.
### In our case, `partial_parses` contains a list of partial parses. `unfinished_parses`
### contains references to the same objects. Thus, you should NOT use the `del` operator
### to remove objects from the `unfinished_parses` list. This will free the underlying memory that
### is being accessed by `partial_parses` and may cause your code to crash.
partial_parses = [PartialParse(sentence) for sentence in sentences]
unfinished_parses = partial_parses[:] # shallow copy
while len(unfinished_parses) > 0:
# 从unfinished parses中取出第一个batchsize的parses
minibatch_partial_parses = unfinished_parses[:batch_size]
# 模型预测minibatch中每个部分解析器的下一个转换步骤
minibatch_transitions = model.predict(minibatch_partial_parses)
# 根据预测结果,在minibatch中的各个局部解析,执行解析步骤
for transition, partial_parse in zip(minibatch_transitions, minibatch_partial_parses):
partial_parse.parse_step(transition)
# 从未完成的解析中删除已完成的解析(空缓冲区和大小为1的堆栈)。
unfinished_parses = [
partial_parse for partial_parse in unfinished_parses
if not (len(partial_parse.buffer) == 0 and len(partial_parse.stack) == 1)
]
for partial_parse in partial_parses:
dependencies.append(partial_parse.dependencies)
### END YOUR CODE
return
第5题
(5)(12分)模型训练。
模型提取表征当前状态的特征向量。我们将使用原始神经依存解析器论文(A Fast and Accurate Dependency Parser using Neural Networks)中提出的特征集。
在utils/parser_utils.py
中已经实现了获取这些特征的功能。这个特征向量由一系列标记组成(例如,堆栈中的最后一个单词,缓冲区中的第一个单词等)。它们能够被表示为,其中m是特征个数,并且,是词汇表的size。然后我们的网络查找每个单词的嵌入,并将它们连接到一个输入向量:
是embedding矩阵,其中每个列向量 是单词的embedding。
网络:
最小化交叉熵:用UAS作为模型的评价指标。
要求:在parser_model.py
文件中能找到基架模型实现该网络,需要完成__init__
、embedding_lookup
、forward
函数,然后完成run.py
文件的train_for_epoch
和train
函数。最后执行python run.py
训练模型,和计算在测试集(Penn树库,用Universal Dependencies标记的)的预测效果。
注意:在本次作业中,需要实现linear layer和embedding layer,所以不要直接使用torch.nn.Linear
和torch.nn.Embedding
。
5.1 parser_model.py文件
1)embedding layer函数:
解析:torch.index_select
函数参考官方文档,其参数为index_select(input, dim, index)
:
-
dim
为0则按行索引;为1则按列索引。 -
index
为索引矩阵,如dim
为0,且tensor[0, 2]
表示第0行和第2行
所有句子向量组成一个嵌入矩阵,而这里的embedding_lookup
就是解决indice和embedding vector之间的映射关系,如之前写RNN时one_hot_lookup
,hello
字符串就表示为input:x_data = [1, 0, 2, 2, 3]
。
# 准备数据
idx2char = ['e', 'h', 'l', 'o']
# input数据是字符串hello
x_data = [1, 0, 2, 2, 3]
y_data = [3, 1, 2, 3, 2]
one_hot_lookup = [[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]]
def embedding_lookup(self, w):
""" Utilize `w` to select embeddings from embedding matrix `self.embeddings`
@param w (Tensor): input tensor of word indices (batch_size, n_features)
@return x (Tensor): tensor of embeddings for words represented in w
(batch_size, n_features * embed_size)
"""
### YOUR CODE HERE (~1-4 Lines)
### TODO:
### 1) For each index `i` in `w`, select `i`th vector from self.embeddings
### 2) Reshape the tensor using `view` function if necessary
###
### Note: All embedding vectors are stacked and stored as a matrix. The model receives
### a list of indices representing a sequence of words, then it calls this lookup
### function to map indices to sequence of embeddings.
###
### This problem aims to test your understanding of embedding lookup,
### so DO NOT use any high level API like nn.Embedding
### (we are asking you to implement that!). Pay attention to tensor shapes
### and reshape if necessary. Make sure you know each tensor's shape before you run the code!
###
### Pytorch has some useful APIs for you, and you can use either one
### in this problem (except nn.Embedding). These docs might be helpful:
### Index select: https://pytorch.org/docs/stable/torch.html#torch.index_select
### Gather: https://pytorch.org/docs/stable/torch.html#torch.gather
### View: https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
### Flatten: https://pytorch.org/docs/stable/generated/torch.flatten.html
x = torch.index_select(self.embeddings, 0, w.flatten()).reshape(w.shape[0], -1)
### END YOUR CODE
return
2)forward函数:
注意在pytorch中的交叉熵函数中已经包含了softmax函数了,所以不需要另外加上softmax函数了。
def forward(self, w):
""" Run the model forward.
Note that we will not apply the softmax function here because it is included in the loss function nn.CrossEntropyLoss
PyTorch Notes:
- Every nn.Module object (PyTorch model) has a `forward` function.
- When you apply your nn.Module to an input tensor `w` this function is applied to the tensor.
For example, if you created an instance of your ParserModel and applied it to some `w` as follows,
the `forward` function would called on `w` and the result would be stored in the `output` variable:
model = ParserModel()
output = model(w) # this calls the forward function
- For more details checkout: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.forward
@param w (Tensor): input tensor of tokens (batch_size, n_features)
@return logits (Tensor): tensor of predictions (output after applying the layers of the network)
without applying softmax (batch_size, n_classes)
"""
### YOUR CODE HERE (~3-5 lines)
### TODO:
### Complete the forward computation as described in write-up. In addition, include a dropout layer
### as decleared in `__init__` after ReLU function.
###
### Please see the following docs for support:
### Matrix product: https://pytorch.org/docs/stable/torch.html#torch.matmul
### ReLU: https://pytorch.org/docs/stable/nn.html?highlight=relu#torch.nn.functional.relu
x = self.embedding_lookup(w) # (bs, feat * emb)
h = F.relu(torch.matmul(x, self.embed_to_hidden_weight) + self.embed_to_hidden_bias)
if self.training:
h = self.dropout(h)
logits = torch.matmul(h, self.hidden_to_logits_weight) + self.hidden_to_logits_bias
### END YOUR CODE
return
5.2 run.py文件
train_for_epoch
和train
函数。
train
函数:
= optim.Adam(parser.model.parameters())
loss_func = nn.CrossEntropyLoss()
在上面的trian
函数里,for循环中的每个epoch,都要运行一次train_for_epoch
函数,返回UAS指标数值。注意只有在train时才用dropout层,如果是test阶段是不需要用dropout层的。
train_for_epoch
函数:
即向前传递,计算loss,反向传播,按照TODO的提示写即可:
def train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, batch_size):
""" Train the neural dependency parser for single epoch.
@return dev_UAS (float): Unlabeled Attachment Score (UAS) for dev data
"""
# Places model in "train" mode, i.e. apply dropout layer
parser.model.train()
n_minibatches = math.ceil(len(train_data) / batch_size)
loss_meter = AverageMeter()
with tqdm(total=(n_minibatches)) as prog:
for i, (train_x, train_y) in enumerate(minibatches(train_data, batch_size)):
# 清空梯度
optimizer.zero_grad()
# store loss for this batch here
loss = 0.
train_x = torch.from_numpy(train_x).long()
train_y = torch.from_numpy(train_y.nonzero()[1]).long()
### YOUR CODE HERE (~4-10 lines)
### TODO:
### 1) Run train_x forward through model to produce `logits`
### 2) Use the `loss_func` parameter to apply the PyTorch CrossEntropyLoss function.
### This will take `logits` and `train_y` as inputs. It will output the CrossEntropyLoss
### between softmax(`logits`) and `train_y`. Remember that softmax(`logits`)
### are the predictions (y^ from the PDF).
### 3) Backprop losses
### 4) Take step with the optimizer
### Please see the following docs for support:
### Optimizer Step: https://pytorch.org/docs/stable/optim.html#optimizer-step
logits = parser.model.forward(train_x)
loss = loss_func(logits, target=train_y)
loss.backward()
optimizer.step()
### END YOUR CODE
prog.update(1)
loss_meter.update(loss.item())
print ("Average Train Loss: {}".format(loss_meter.avg))
print("Evaluating on dev set",)
parser.model.eval() # Places model in "eval" mode, i.e. don't apply dropout layer
dev_UAS, _ = parser.parse(dev_data)
print("- dev UAS: {:.2f}".format(dev_UAS * 100.0))
return
运行代码的最终结果:
dev UAS: 88.60 (dev验证集)
test UAS: 89.08
其中UAS是(Unlabeled attachment score)。
基于神经网络的依存句法解析器
神经网络模型使用【全连接神经网络】加sotfmax分类(句法依存特征==>嵌入词向量==>Relu(xW + b1)=>Dropout =>pred(h_dropU + b2)===>softmax_cross_entropy_with_logits)
1,load_and_preprocess_data
函数返回解析器,词嵌入矩阵,训练集(特征提取),验证集,测试集
parser, embeddings, train_examples, dev_set, test_set = load_and_preprocess_data(debug)
2,创建模型类实例model = ParserModel(config, embeddings)
,其中ParserModel
继承Model
model初始化传入config, embeddings参数,调用父类Model的build
方法,在子类ParserModel
重载实现add_placeholders()
、add_prediction_op()
、add_loss_op(self.pred)
、add_training_op(self.loss)
。
3,模型训练
model.fit(session, saver, parser, train_examples, dev_set)
4,测试集解析:
UAS, dependencies = parser.parse(test_set)
四、实验过程和结果
这里的进度条是通过tqdm
工具包实现的,进度条的结果格式为:已用时间,预计总共用时时,每轮所用时间。
SyntaxWarning: "is" with a literal. Did you mean "=="?
return [("RA" if pp.stack[1] is "right" else "LA") if len(pp.buffer) == 0 else "S"
================================================================================
INITIALIZING
================================================================================
Loading data...
took 2.58 seconds
Building parser...
took 1.66 seconds
Loading pretrained embeddings...
took 8.80 seconds
Vectorizing data...
took 1.95 seconds
Preprocessing training data...
took 65.94 seconds
took 0.18 seconds
================================================================================
TRAINING
================================================================================
Epoch 1 out of 10
100%|██████████| 1848/1848 [02:07<00:00, 14.54it/s]
Average Train Loss: 0.1781648173605725
Evaluating on dev set
1445850it [00:00, 28400918.10it/s]
- dev UAS: 84.76
New best dev UAS! Saving model.
Epoch 2 out of 10
100%|██████████| 1848/1848 [02:21<00:00, 13.09it/s]
Average Train Loss: 0.11059159756480873
Evaluating on dev set
1445850it [00:00, 21319509.36it/s]
- dev UAS: 86.57
New best dev UAS! Saving model.
Epoch 3 out of 10
100%|██████████| 1848/1848 [02:29<00:00, 12.35it/s]
Average Train Loss: 0.09602350440255297
Evaluating on dev set
1445850it [00:00, 21010828.57it/s]
- dev UAS: 87.23
New best dev UAS! Saving model.
Epoch 4 out of 10
100%|██████████| 1848/1848 [02:24<00:00, 12.78it/s]
Average Train Loss: 0.08655059076765012
Evaluating on dev set
1445850it [00:00, 18410020.64it/s]
- dev UAS: 88.03
New best dev UAS! Saving model.
Epoch 5 out of 10
100%|██████████| 1848/1848 [02:29<00:00, 12.32it/s]
Average Train Loss: 0.07943204295664251
Evaluating on dev set
1445850it [00:00, 24345468.35it/s]
- dev UAS: 88.25
New best dev UAS! Saving model.
Epoch 6 out of 10
100%|██████████| 1848/1848 [02:28<00:00, 12.42it/s]
Average Train Loss: 0.07376304407124266
Evaluating on dev set
1445850it [00:00, 23765859.77it/s]
- dev UAS: 88.06
Epoch 7 out of 10
100%|██████████| 1848/1848 [02:11<00:00, 14.08it/s]
Average Train Loss: 0.06907538355638583
Evaluating on dev set
1445850it [00:00, 16358657.93it/s]
- dev UAS: 88.15
Epoch 8 out of 10
100%|██████████| 1848/1848 [02:12<00:00, 13.92it/s]
Average Train Loss: 0.06480039135468277
Evaluating on dev set
1445850it [00:00, 20698658.75it/s]
- dev UAS: 88.45
New best dev UAS! Saving model.
Epoch 9 out of 10
100%|██████████| 1848/1848 [02:31<00:00, 12.22it/s]
Average Train Loss: 0.061141976250085606
Evaluating on dev set
1445850it [00:00, 22635715.12it/s]
- dev UAS: 88.41
Epoch 10 out of 10
100%|██████████| 1848/1848 [02:18<00:00, 13.36it/s]
Average Train Loss: 0.05778654704870277
Evaluating on dev set
1445850it [00:00, 30163164.76it/s]
- dev UAS: 88.60
New best dev UAS! Saving model.
================================================================================
TESTING
================================================================================
Restoring the best model weights found on the dev set
Final evaluation on test set
2919736it [00:00, 31476695.98it/s]
- test UAS: 89.08
Reference
(1)详解Transition-based Dependency parser基于转移的依存句法解析器 (2)斯坦福大学CS224N课程作业
(3)CS224N Lecture5:依存句法分析 (4)段智华大佬的assignment3作业笔记