5.8 Summary 小结
• Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts-of-speech. Parts-of-speech are assigned short labels, or tags, such as NN and VB.
单词可以分成类,例如名词,动词,形容词以及副词。这些类被称为词汇类别或者词性。词性被赋给了短标签或者标记,例如NN或者VB。
• The process of automatically assigning parts-of-speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.
给文中的单词自动标注词性的过程称为词性标注。
• Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations, including predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.
自动标注在NLP流程中是重要的一步,并且在各种情况下都非常有效,包括预测先前未出现单词的行为,分析语料库的单词使用,以及文字转语音系统。
• Some linguistic corpora, such as the Brown Corpus, have been POS tagged.
一些语言语料库,例如布朗语料库,已经进行了POS标记。
• A variety of tagging methods are possible, e.g., default tagger, regular expression tagger, unigram tagger, and n-gram taggers. These can be combined using a technique known as backoff.
各种不同的标记方法都是合适的,例如,缺省tagger,正则表达式tagger,unigram tagger以及n-gram tagger。这些可以使用一种称为backoff的技术进行组合。
• Taggers can be trained and evaluated using tagged corpora.
Tagger可以进行训练并且用标记了的语料库进行评分。
• Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we back off to a more general model (such as a unigram tagger).
Backoff是一个用于组合模型的方法:当一个更详细的模型(例如bigram tagger)不能为给定内容分配标记,我们后退到一个更加一般化的模型(例如unigram tagger)
• Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.
词性标注是NLP中一个重要的,早先的序列分类任务:在序列任意某点的分类决策使用了局部语境中的单词和标记。
• A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.
字典用来映射任意类型之间的信息,例如字符串和数字:freq[‘cat’]=12。我们使用大括号标记来创建字典:pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.
• N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.
N-gram tag可以定义为较大数值的n,但是一旦n大于3,我们常常会面临稀疏数据问题,即时使用大量的训练数据,我们仅可以看到可能的上下文的细小部分。
• Transformation-based tagging involves learning a series of repair rules of the form “change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors.
基于转换的标记包含了一系列的“change tag s to tag t in context c”形式的修复规则,每个规则修复错误并且可能地引入更小的错误。
None