回答下列问题:

  (1)如何能构建一个系统,以至从非结构化文本中提取结构化数据?

  (2)有哪些稳健的方法识别一个文本描述的实体和关系?

  (3)哪些语料库适合这项工作,如何使用它们来训练和评估模型?

一 信息提取

信息有很多种”形状“和”大小“,一个重要的形式是结构化数据:实体和关系的规范和可预测的组织。例如:我们可能对公司和地点之间的关系,可用关系数据库存储。

但如果我们尝试从文本中获得相似的信息,事情就比较麻烦了。如何从一段文字中发现一个实体和关系的表呢?

然后,利用强大的查询工具,如SQL,这种从文本获取意义的方法被称为“信息提取”

信息提取有许多应用,包括商业智能、简历收获、媒体分析、情感检测、专利检索及电子邮件扫描。当前研究的一个特别重要的领域是提取出电子科学文献的结构化数据,特别是在生物学和医学领域。

#信息提取结构

python 抽取数值 python文本信息抽取_python 抽取数值

要执行前面3个任务,句子分割器、分词器和词性标注器

import nltk, re, pprint
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)                      #句子分割
    sentences = [nltk.word_tokenize(sent) for sent in sentences]  #分词
    sentences = [nltk.pos_tag(sent) for sent in sentence]         #词性标注器

二 分块

用于实体识别的基本技术是分块(chunking)

python 抽取数值 python文本信息抽取_分块_02

小框显示词级标识符和词性标注,同时,大框显示较高级的程序分块

在本节上,我们将在较深的层面上探讨程序分块,以组块的定义和表示开始,我们将看到正则表达式和n-gram方法分块,使用CoNLL-2000分块语料库开发和评估分块器。

#名词短语分块

NP-chunking(名词短语分块),寻找单独名词短语对应的块

NP-分块信息最有用的来源之一是词性标记。这是在信息提取系统中进行词性标注的动机之一。

为了创建NP-分块,首先定义分块语法,规定句子应如何分块。在本例中,使用一个正则表达式规则定义一个简单的语法。

这条规则是NP-分块有可选的且后面跟着任意数目形容词的限定词和名词组成。使用此语法,我们创建了组块分析器,测试我门的例句。结果得到树状图,可以输出或显示图形。

sentence = [("the","DT"),("little","JJ"),("yellow","JJ"),("dog","NN"),("barked","VBD"),("at","IN"),("the","DT"),("cat","NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print result
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
result.draw()


python 抽取数值 python文本信息抽取_分块_03


#标记模式

使用图形界面nltk.app.chunkparser()

#用正则表达式分块

grammer = r"""
   NP: {<DT|PP\$>?<JJ>*<NN>}    #匹配一个可选的限定词或所有格代名词
       {<NNP>+}                 #匹配一个或多个专有名词
"""
cp = nltk.RegexpParser(grammer)
sentence = [("Rapunzel","NNP"),("let","VBD"),("down", "RP"),("her","PP$"),("long","JJ"),("golden","JJ"),("hair","NN")]
print cp.parse(sentence)
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))
nouns = [("money","NN"),("market","NN"),("fund","NN")]
grammar = "NP: {<NN><NN>}"  #如果将匹配两个连续名词的文本的规则应用到包含3个连续名词的文本中,则只有前两个名词被分块
cp = nltk.RegexpParser(grammar)
print cp.parse(nouns)
(S (NP money/NN market/NN) fund/NN)

#探索文本语料库

使用分块器可以在已标注的语料库中提取匹配特定词性标记序列的短语

cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': 
            print subtree
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
......

#缝隙

为不包括在大块中的标识符序列定义一个缝隙

加缝隙是从大块中去除标识符序列的过程

grammar = r"""
NP:
   {<.*>+}
   }<VBD|IN>+{"""
sentence = [("the","DT"),("little","JJ"),("yellow","JJ"),("dog","NN"),("barked","VBD"),("at","IN"),("the","DT"),("cat","NN")]
cp = nltk.RegexpParser(grammar)
print cp.parse(sentence)
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))

#分块的表示:标记与树状图

作为标注和分析之间的中间状态,块结构可以使用标记或树状图来表示。使用最广泛的表示是IOB标记

在这个方案中,每个标识符被用3个特殊的块标签之一标注,I(inside,内部),O(outside,外部)或B(begin,开始)。

B标志着它是分块的开始。块内的标识符子序列被标志为I,其他为O

B和I标记是块类型的后缀,如B-NP, I-NP。

NLTK用树状图作为分块的内部表示,却提供这些树状图与IOB之间格式转换的方法

3 开发和评估分块器

如何评估分块器

#读取IOB格式与CoNLL2000分块语料库

CoNLL2000分块语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式分块标记。

from nltk.corpus import conll2000
print conll2000.chunked_sents('train.txt')[99]
(S
  (PP Over/IN)
  (NP a/DT cup/NN)
  (PP of/IN)
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  (VP told/VBD)
  (NP his/PRP$ story/NN)
  ./.)

包含

3中分块类型:NP分块,VP分块,PP分块


print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99]  #只选择NP分块

#简单评估和基准

cp = nltk.RegexpParser("")     #不分块
test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
print cp.evaluate(test_sents)  #评估结果
ChunkParse score:
    IOB Accuracy:  43.4%%
    Precision:      0.0%%
    Recall:         0.0%%
    F-Measure:      0.0%%
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)     #初级的正则表达式分块器
test_sents = conll2000.chunked_sents('test.txt')
print cp.evaluate(test_sents)  #评估结果
ChunkParse score:
    IOB Accuracy:  62.5%%
    Precision:     70.6%%
    Recall:        38.5%%
    F-Measure:     49.8%%

使用unigram标注器对名词短语分块

#使用训练语料找到对每个词性标记最有可能的块标记(I、O或B)
#可以用unigram标注器建立一个分块器,但不是要确定每个词的正确词性标记,而是给定每个词的词性标记,尝试确定正确的块标记
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        #为词性标注IOB块标记
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags) #转换成分块树状图
#使用CoNLL2000分块语料库训练
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print unigram_chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.8%%
    F-Measure:     83.2%%
postags = sorted(set(pos for sent in train_sents for (word,pos) in sent.leaves()))
print unigram_chunker.tagger.tag(postags)
[(u'#', u'B-NP'), (u'$', u'B-NP'), (u"''", u'O'), (u'(', u'O'), (u')', u'O'), (u',', u'O'), (u'.', u'O'), (u':', u'O'), (u'CC', u'O'), (u'CD', u'I-NP'), (u'DT', u'B-NP'), (u'EX', u'B-NP'), (u'FW', u'I-NP'), (u'IN', u'O'), (u'JJ', u'I-NP'), (u'JJR', u'B-NP'), (u'JJS', u'I-NP'), (u'MD', u'O'), (u'NN', u'I-NP'), (u'NNP', u'I-NP'), (u'NNPS', u'I-NP'), (u'NNS', u'I-NP'), (u'PDT', u'B-NP'), (u'POS', u'B-NP'), (u'PRP', u'B-NP'), (u'PRP$', u'B-NP'), (u'RB', u'O'), (u'RBR', u'O'), (u'RBS', u'B-NP'), (u'RP', u'O'), (u'SYM', u'O'), (u'TO', u'O'), (u'UH', u'O'), (u'VB', u'O'), (u'VBD', u'O'), (u'VBG', u'O'), (u'VBN', u'O'), (u'VBP', u'O'), (u'VBZ', u'O'), (u'WDT', u'B-NP'), (u'WP', u'B-NP'), (u'WP$', u'B-NP'), (u'WRB', u'O'), (u'``', u'O')]
#使用训练语料找到对每个词性标记最有可能的块标记(I、O或B)
#可以用bigram标注器建立一个分块器,但不是要确定每个词的正确词性标记,而是给定每个词的词性标记,尝试确定正确的块标记
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        #为词性标注IOB块标记
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags) #转换成分块树状图
bigram_chunker = BigramChunker(train_sents)
print bigram_chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  93.3%%
    Precision:     82.3%%
    Recall:        86.8%%
    F-Measure:     84.5%%
class ConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word,tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set, algorithm='megam', trace=0) #最大熵
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
    
class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]    
    return {"pos": pos} #只提供当前标识符的词性标记
chunker = ConsecutiveNPChunker(train_sents)
print chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  92.9%%
    Precision:     79.9%%
    Recall:        86.7%%
    F-Measure:     83.2%%
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0: 
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]  
   return {"pos": pos, "prevpos": prevpos} #模拟相邻标记之间的相互作用
chunker = ConsecutiveNPChunker(train_sents)
print chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  93.7%%
    Precision:     82.1%%
    Recall:        87.2%%
    F-Measure:     84.5%%

def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "word": word, "prevpos": prevpos}   #增加词的内容
chunker = ConsecutiveNPChunker(train_sents)
print chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  94.2%%
    Precision:     83.2%%
    Recall:        88.3%%
    F-Measure:     85.7%%
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "tags-since-dt": tags_since_dt(sentence, i)}  #预取特征、配对功能和复杂的语境特征
def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == "DT":
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))
chunker = ConsecutiveNPChunker(train_sents)
print chunker.evaluate(test_sents)
ChunkParse score:
    IOB Accuracy:  96.0%%
    Precision:     88.8%%
    Recall:        91.1%%
    F-Measure:     89.9%%
grammar = r"""
   NP: {<DT|JJ|NN.*>+}
   PP: {<IN><NP>}
   VP: {<VB.*><NP|PP|CLAUSE>+$}
   CLAUSE: {<NP><VP>}
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Mary","NN"), ("saw","VBD"),("the","DT"),("cat","NN"),("sit","VB"),("on","IN"),("the","DT"),("mat","NN")]
print cp.parse(sentence)
(S
  (NP Mary/NN)
  saw/VBD   #无法识别VP
  (CLAUSE
    (NP the/DT cat/NN)
    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
cp = nltk.RegexpParser(grammar, loop=2)  #添加循环
print cp.parse(sentence)
(S
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
sent = nltk.corpus.treebank.tagged_sents()[22]
print nltk.ne_chunk(sent, binary=True)    #如果设置参数binary=True,那么命名实体只被标注为NE
(S
  The/DT
  (NE U.S./NNP)
  is/VBZ
  one/CD
  of/IN
print nltk.ne_chunk(sent)    #PERSON, ORGANIZATION and GPE
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  ......
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ,/,
  a/DT
  professor/NN
  of/IN
  pathlogy/NN
  at/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (PERSON Vermont/NNP College/NNP)
  of/IN
  (GPE Medicine/NNP)
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
        print nltk.sem.relextract.rtuple(rel)
[ORG: u'WHYY'] u'in' [LOC: u'Philadelphia']
[ORG: u'McGlashan & Sarrail'] u'firm in' [LOC: u'San Mateo']
[ORG: u'Freedom Forum'] u'in' [LOC: u'Arlington']
[ORG: u'Brookings Institution'] u', the research group in' [LOC: u'Washington']
[ORG: u'Idealab'] u', a self-described business incubator based in' [LOC: u'Los Angeles']
[ORG: u'Open Text'] u', based in' [LOC: u'Waterloo']