第11章 语言数据管理
- 11.1 语料库结构:一个案例研究
- 主要设计特点
- 基本数据类型
- 11.2 语料库生命周期
- 语料库创建的三种方案
- 质量控制
- 维护与演变
- 11.3 数据采集
- 从网上获取数据
- 从字处理器文件获取数据
- 从电子表格和数据库中获取数据
- 转换数据格式
- 决定要包含的标注层
- 标准和工具
- 处理濒危语言时特别注意事项
- 11.4 使用XML
- 语言结构中使用XML
- ElementTree接口
- 使用ElementTree访问Toolbox数据
- 格式化条目
- 11.5 使用Toolbox 数据
- 为每个条目添加一个字段
- 验证Toolbox词汇
- 11.6 使用OLAC元数据描述语言资源
- 元数据是什么?
- OLAC:开放语言档案社区
- 11.7 小结
import nltk,re
- 如何设计一种新的语言资源,并确保它的覆盖面、平衡以及支持广泛用途的文档?
- 现有数据对某些分析工具格式不兼容,如何才能将其转换成合适的格式?
- 有什么好的方法来记录已经创建的资源的存在,让其他人可以很容易地找到它?
11.1 语料库结构:一个案例研究
phonetic = nltk.corpus.timit.phones('dr1-fvmh0/sa1')
phonetic[:10]
['h#', 'sh', 'iy', 'hv', 'ae', 'dcl', 'y', 'ix', 'dcl', 'd']
nltk.corpus.timit.word_times('dr1-fvmh0/sa1')
[('she', 7812, 10610),
('had', 10610, 14496),
('your', 14496, 15791),
('dark', 15791, 20720),
('suit', 20720, 25647),
('in', 25647, 26906),
('greasy', 26906, 32668),
('wash', 32668, 37890),
('water', 38531, 42417),
('all', 43091, 46052),
('year', 46052, 50522)]
timitdict = nltk.corpus.timit.transcription_dict()
timitdict['greasy'] + timitdict['wash'] + timitdict['water']
['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't', 'axr']
phonetic[17:30]
['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax']
nltk.corpus.timit.spkrinfo('dr1-fvmh0')
SpeakerInfo(id='VMH0', sex='F', dr='1', use='TRN', recdate='03/11/86', birthdate='01/08/60', ht='5\'05"', race='WHT', edu='BS', comments='BEST NEW ENGLAND ACCENT SO FAR')
主要设计特点
- 语料库包含语音和字形两个标注层。一般情况下,文字或语音语料库可能在多个不同的语言学层次标注,包括形态、句法和段落层次
- 在多个维度的变化与方言地区和二元音覆盖范围之间取得平衡。
- 将原始语言学事件作为录音来捕捉和作为标注来捕捉之间有明显的区分。
- 语料库的层次结构。每个句子4 个文件,500 个说话者每人10个句子,有20,000 个文件。这些被组织成一个树状结构,在顶层分成训练集和测试集,用于开发和评估统计模型。
基本数据类型
TIMIT 语料库只包含两种基本数据类型:词典和文本
11.2 语料库生命周期
语料库创建的三种方案
- 语料库的一种类型是设计在创作者的探索过程中逐步展现。这是典型的传统“领域语言学”模式,即来自会话的材料在它被收集的时候就被分析,明天的想法往往基于今天的分析
中产生的问题。 - 另一种语料库创建方案是典型的实验研究,其中一些精心设计的材料被从一定范围的人类受试者中收集,然后进行分析来评估一个假设或开发一种技术。
- 一个特定的语言收集“参考语料”,如美国国家语料库(American National Corpus,ANC)和英国国家语料库(British National Corpus,BNC)。这里的目标已经成为产生各种形式、风格和语言的使用的一个全面的记录。
质量控制
- Kappa 系数k测量两个人判断类别和修正预期的期望一致性的一致性。
例如:假设要标注一个项目,四种编码选项可能性相同。这种情况下,两个人随机编码预计有25%可能达成一致。因此,25%一致性将表示为k = 0,相应的较好水平的一致性将依比例决定。对于一个50%的一致性,我们将得到k = 0.333,因为50 是从25 到100 之间距离的三分之一。
- 还可以测量语言输入的两个独立分割的一致性,例如:分词、句子分割、命名实体识别。
windowdiff 是评估两个分割一致性的一个简单的算法,通过在数据上移动一个滑动窗口计算近似差错的部分得分。
s1 = "00000010000000001000000"
s2 = "00000001000000010000000"
s3 = "00010000000000000001000"
nltk.windowdiff(s1, s1, 3) #窗口大小为3
0.0
nltk.windowdiff(s1, s2, 3)
0.19047619047619047
nltk.windowdiff(s2, s3, 3)
0.5714285714285714
windowdiff 计算在一对字符串上滑动窗口。在每个位置它计算两个字符串在这个窗口内的边界的总数,然后计算差异,最后累加这些差异。可以增加或缩小窗口的大小来控制测量的敏感度。
维护与演变
11.3 数据采集
从网上获取数据
从字处理器文件获取数据
legal_pos = set(['n', 'v.t.', 'v.i.', 'adj', 'det'])
pattern = re.compile(r"'font-size:11.0pt'>([a-z.]+)<")
document = open("dict.htm", encoding="windows-1252").read()
used_pos = set(re.findall(pattern, document))
illegal_pos = used_pos.difference(legal_pos)
illegal_pos
set()
例11-1. 将Microsoft Word 创建的HTML 转换成CSV
from bs4 import BeautifulSoup
def lexical_data(html_file, encoding="utf-8"):
SEP = '_ENTRY'
html = open(html_file, encoding=encoding).read()
html = re.sub(r'<p', SEP + '<p', html)
text = BeautifulSoup(html).get_text()
text = ' '.join(text.split())
for entry in text.split(SEP):
if entry.count(' ') > 2:
yield entry.split(' ', 3)
import csv
writer = csv.writer(open("dict1.csv", "w", encoding="utf-8"))
writer.writerows(lexical_data("dict.htm", encoding="windows-1252"))
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
markup_type=markup_type))
从电子表格和数据库中获取数据
import csv
lexicon = csv.reader(open('dict.csv',encoding="utf-8"))
pairs = [(lexeme, defn) for (lexeme, _, _, defn) in lexicon]
lexemes, defns = zip(*pairs)
defn_words = set(w for defn in defns for w in defn.split())
sorted(defn_words.difference(lexemes))
['...',
'a',
'and',
'body',
'by',
'cease',
'condition',
'down',
'each',
'foot',
'lifting',
'mind',
'of',
'progress',
'setting',
'sleep',
'to']
转换数据格式
idx = nltk.Index((defn_word, lexeme)
... for (lexeme, defn) in pairs
... for defn_word in nltk.word_tokenize(defn)
... if len(defn_word) > 3)
with open("dict.idx", "w",encoding="utf-8") as idx_file:
for word in sorted(idx):
idx_words = ', '.join(idx[word])
idx_line = "{}: {}".format(word, idx_words)
print(idx_line, file=idx_file)
决定要包含的标注层
- 分词:文本的书写形式不能明确地识别它的标识符。分词和规范化的版本作为常规的正式版本的补充可能是一个非常方便的资源。
- 断句:正如我们在第3 章中看到的,断句比它看上去的似乎更加困难。因此,一些语料库使用明确的标注来断句。
- 分段:段和其他结构元素(标题,章节等)可能会明确注明。
- 词性:文档中的每个单词的词类。
- 句法结构:一个树状结构,显示一个句子的组成结构。
- 浅层语义:命名实体和共指标注,语义角色标签。
- 对话与段落:对话行为标记,修辞结构。
标准和工具
处理濒危语言时特别注意事项
mappings = [('ph', 'f'), ('ght', 't'), ('^kn', 'n'), ('qu', 'kw'),
... ('[aeiou]+', 'a'), (r'(.)\1', r'\1')]
def signature(word):
for patt, repl in mappings:
word = re.sub(patt, repl, word)
pieces = re.findall('[^aeiou]+', word)
return ''.join(char for piece in pieces for char in sorted(piece))[:8]
signature('illefent')
'lfnt'
signature('ebsekwieous')
'bskws'
signature('nuculerr')
'nclr'
signatures = nltk.Index((signature(w), w) for w in nltk.corpus.words.words())
signatures[signature('nuculerr')]
['anicular',
'inocular',
'nucellar',
'nuclear',
'unicolor',
'uniocular',
'unocular']
def rank(word, wordlist):
ranked = sorted((nltk.edit_distance(word, w), w) for w in wordlist)
return [word for (_, word) in ranked]
def fuzzy_spell(word):
sig = signature(word)
if sig in signatures:
return rank(word, signatures[sig])
else:
return []
fuzzy_spell('illefent')
['olefiant', 'elephant', 'oliphant', 'elephanta']
fuzzy_spell('ebsekwieous')
['obsequious']
fuzzy_spell('nucular')
['anicular',
'inocular',
'nucellar',
'nuclear',
'unocular',
'uniocular',
'unicolor']
11.4 使用XML
语言结构中使用XML
whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head whalenounany of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the heada very large person; impressive in size or qualities whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head whale.n.02 a very large person; impressive in size or qualities giant.n.04 whale noun any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head a very large person; impressive in size or qualities ## XML 的作用
ElementTree接口
merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
raw = open(merchant_file).read()
print(raw[:163])
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="shakes.css"?>
<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
<PLAY>
<TITLE>The Merchant of Venice</TITLE>
print(raw[1789:2006])
<TITLE>ACT I</TITLE>
<SCENE><TITLE>SCENE I. Venice. A street.</TITLE>
<STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>
<SPEECH>
<SPEAKER>ANTONIO</SPEAKER>
<LINE>In sooth, I know not why I am so sad:</LINE>
from xml.etree.ElementTree import ElementTree
merchant = ElementTree().parse(merchant_file)
merchant
<Element 'PLAY' at 0x000000000C3BDE08>
merchant[0]
<Element 'TITLE' at 0x000000000B171CC8>
merchant[0].text
'The Merchant of Venice'
merchant.getchildren()
[<Element 'TITLE' at 0x000000000B171CC8>,
<Element 'PERSONAE' at 0x000000000C580548>,
<Element 'SCNDESCR' at 0x000000000C58E908>,
<Element 'PLAYSUBT' at 0x000000000C58E958>,
<Element 'ACT' at 0x000000000C58E9A8>,
<Element 'ACT' at 0x000000000C5C9BD8>,
<Element 'ACT' at 0x000000000C5FA3B8>,
<Element 'ACT' at 0x000000000C626958>,
<Element 'ACT' at 0x000000000C6493B8>]
merchant[-2][0].text
'ACT IV'
merchant[-2][1]
<Element 'SCENE' at 0x000000000C6269F8>
merchant[-2][1][0].text
'SCENE I. Venice. A court of justice.'
merchant[-2][1][54]
<Element 'SPEECH' at 0x000000000C632AE8>
merchant[-2][1][54][0]
<Element 'SPEAKER' at 0x000000000C632B38>
merchant[-2][1][54][0].text
'PORTIA'
merchant[-2][1][54][1]
<Element 'LINE' at 0x000000000C632B88>
merchant[-2][1][54][1].text
"The quality of mercy is not strain'd,"
for i, act in enumerate(merchant.findall('ACT')):
for j, scene in enumerate(act.findall('SCENE')):
for k, speech in enumerate(scene.findall('SPEECH')):
for line in speech.findall('LINE'):
if 'music' in str(line.text):
print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))
Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
Act 5 Scene 1 Speech 23: And draw her home with music.
Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
Act 5 Scene 1 Speech 25: The man that hath no music in himself,
Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
Act 5 Scene 1 Speech 32: No better a musician than the wren.
from collections import Counter
speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
speaker_freq = Counter(speaker_seq)
top5 = speaker_freq.most_common(5)
top5
[('PORTIA', 117),
('SHYLOCK', 79),
('BASSANIO', 73),
('GRATIANO', 48),
('ANTONIO', 47)]
from collections import defaultdict
abbreviate = defaultdict(lambda: 'OTH')
for speaker, _ in top5:
abbreviate[speaker] = speaker[:4]
speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
cfd.tabulate()
ANTO BASS GRAT OTH PORT SHYL
ANTO 0 11 4 11 9 12
BASS 10 0 11 10 26 16
GRAT 6 8 0 19 9 5
OTH 8 16 18 153 52 25
PORT 7 23 13 53 0 21
SHYL 15 15 2 26 21 0
使用ElementTree访问Toolbox数据
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')
lexicon[3][0]
<Element 'lx' at 0x000000000C671048>
lexicon[3][0].tag
'lx'
lexicon[3][0].text
'kaa'
[lexeme.text.lower() for lexeme in lexicon.findall('record/lx')][:5]
['kaa', 'kaa', 'kaa', 'kaakaaro', 'kaakaaviko']
import sys
from nltk.util import elementtree_indent
from xml.etree.ElementTree import ElementTree
elementtree_indent(lexicon)
tree = ElementTree(lexicon[3])
tree.write(sys.stdout, encoding='unicode')
<record>
<lx>kaa</lx>
<ps>N</ps>
<pt>MASC</pt>
<cl>isi</cl>
<ge>cooking banana</ge>
<tkp>banana bilong kukim</tkp>
<pt>itoo</pt>
<sf>FLORA</sf>
<dt>12/Aug/2005</dt>
<ex>Taeavi iria kaa isi kovopaueva kaparapasia.</ex>
<xp>Taeavi i bin planim gaden banana bilong kukim tasol long paia.</xp>
<xe>Taeavi planted banana in order to cook it.</xe>
</record>
格式化条目
html = "<table>\n"
for entry in lexicon[70:80]:
lx = entry.findtext('lx')
ps = entry.findtext('ps')
ge = entry.findtext('ge')
html += " <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n" % (lx, ps, ge)
html += "</table>"
print(html)
<table>
<tr><td>kakae</td><td>???</td><td>small</td></tr>
<tr><td>kakae</td><td>CLASS</td><td>child</td></tr>
<tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr>
<tr><td>kakapikoa</td><td>???</td><td>small</td></tr>
<tr><td>kakapikoto</td><td>N</td><td>newborn baby</td></tr>
<tr><td>kakapu</td><td>V</td><td>place in sling for purpose of carrying</td></tr>
<tr><td>kakapua</td><td>N</td><td>sling for lifting</td></tr>
<tr><td>kakara</td><td>N</td><td>arm band</td></tr>
<tr><td>Kakarapaia</td><td>N</td><td>village name</td></tr>
<tr><td>kakarau</td><td>N</td><td>frog</td></tr>
</table>
11.5 使用Toolbox 数据
from nltk.corpus import toolbox
lexicon = toolbox.xml('rotokas.dic')
sum(len(entry) for entry in lexicon) / len(lexicon)
13.635955056179775
为每个条目添加一个字段
from xml.etree.ElementTree import SubElement
def cv(s):
s = s.lower()
s = re.sub(r'[^a-z]', r'_', s)
s = re.sub(r'[aeiou]', r'V', s)
s = re.sub(r'[^V_]', r'C', s)
return (s)
def add_cv_field(entry):
for field in entry:
if field.tag == 'lx':
cv_field = SubElement(entry, 'cv')
cv_field.text = cv(field.text)
lexicon = toolbox.xml('rotokas.dic')
add_cv_field(lexicon[53])
print(nltk.toolbox.to_sfm_string(lexicon[53]))
\lx kaeviro
\ps V
\pt A
\ge lift off
\ge take off
\tkp go antap
\sc MOTION
\vx 1
\nt used to describe action of plane
\dt 03/Jun/2005
\ex Pita kaeviroroe kepa kekesia oa vuripierevo kiuvu.
\xp Pita i go antap na lukim haus win i bagarapim.
\xe Peter went to look at the house that the wind destroyed.
\cv CVVCVCV
验证Toolbox词汇
from collections import Counter
field_sequences = Counter(':'.join(field.tag for field in entry) for entry in lexicon)
field_sequences.most_common()[:5]
[('lx:ps:pt:ge:tkp:dt:ex:xp:xe', 41),
('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe', 37),
('lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe', 27),
('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe', 20),
('lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe', 17)]
例11-3. 使用上下文无关文法验证Toolbox 中的条目
grammar = nltk.CFG.fromstring('''
S -> Head PS Glosses Comment Date Sem_Field Examples
Head -> Lexeme Root
Lexeme -> "lx"
Root -> "rt" |
PS -> "ps"
Glosses -> Gloss Glosses |
Gloss -> "ge" | "tkp" | "eng"
Date -> "dt"
Sem_Field -> "sf"
Examples -> Example Ex_Pidgin Ex_English Examples |
Example -> "ex"
Ex_Pidgin -> "xp"
Ex_English -> "xe"
Comment -> "cmt" | "nt" |
''')
def validate_lexicon(grammar, lexicon, ignored_tags):
rd_parser = nltk.RecursiveDescentParser(grammar)
for entry in lexicon:
marker_list = [field.tag for field in entry if field.tag not in ignored_tags]
if list(rd_parser.parse(marker_list)):
print("+", ':'.join(marker_list))
else:
print("-", ':'.join(marker_list))
lexicon = toolbox.xml('rotokas.dic')[10:20]
ignored_tags = ['arg', 'dcsv', 'pt', 'vx']
validate_lexicon(grammar, lexicon, ignored_tags)
- lx:ps:ge:tkp:sf:nt:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:dt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:tkp:nt:sf:dt
- lx:ps:ge:tkp:dt:cmt:ex:xp:xe:ex:xp:xe
- lx:ps:ge:ge:ge:tkp:cmt:dt:ex:xp:xe
- lx:rt:ps:ge:ge:tkp:dt
- lx:rt:ps:ge:eng:eng:eng:ge:tkp:tkp:dt:cmt:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe:ex:xp:xe
- lx:rt:ps:ge:tkp:dt:ex:xp:xe
- lx:ps:ge:ge:tkp:dt:ex:xp:xe:ex:xp:xe
为Toolbox 词典分块
grammar = r"""
lexfunc: {<lf>(<lv><ln|le>*)*}
example: {<rf|xv><xn|xe>*}
sense: {<sn><ps><pn|gv|dv|gn|gp|dn|rn|ge|de|re>*<example>*<lexfunc>*}
record: {<lx><hm><sense>+<dt>}
"""
#coding=utf-8
from xml.etree.ElementTree import ElementTree
from nltk.toolbox import ToolboxData
db = ToolboxData()
db.open(nltk.data.find('corpora/toolbox/iu_mien_samp.db'))
# lexicon = db.parse(grammar, encoding='utf8')
tree = ElementTree(lexicon)
# with open("iu_mien_samp.xml", "wb") as output:
# tree.write(output)
11.6 使用OLAC元数据描述语言资源
元数据是什么?
OLAC:开放语言档案社区
11.7 小结
- 大多数语料库中基本数据类型是已标注的文本和词汇。文本有时间结构,而词汇有记录结构。
- 语料库的生命周期,包括数据收集、标注、质量控制以及发布。发布后生命周期仍然继续,因为语料库会在研究过程中被修改和丰富。
- 语料库开发包括捕捉语言使用的代表性的样本与使用任何一个来源或文体都有足够的材料之间的平衡;增加变量的维度通常由于资源的限制而不可行。
- XML 提供了一种有用的语言数据的存储和交换格式,但解决普遍存在的数据建模问题没有捷径。
- Toolbox 格式被广泛使用在语言记录项目中;我们可以编写程序来支持Toolbox 文件的维护,将它们转换成XML。
- 开放语言档案社区(OLAC)提供了一个用于记录和发现语言资源的基础设施。