信息抽取中常用的数据集

原创

说文科技 2022-01-12 16:18:09 博主文章分类：nlp ©著作权

文章标签 NLP 数据集信息抽取 ide 数据 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者说文科技的原创作品，请联系作者获取转载授权，否则将追究法律责任

前言

本文详细讲述NLP信息抽取领域中常用的数据集，并给出一些关键点的阐述。
DocRED数据集质量极差，在涉足这个任务之前，还请多多了解它才行！

1.文档级

ACE05

collected from various sources, including news articles and online forums.
provides entity, relation, and event annotations for a collection of documents from a variety of domains
lacks coreference annotations

2.句子级

2.1 NYT

sample from New York Times news articles and annotated by distant supervision.

WebNLG: is originally created for Natural Lagugage Generation task and is applied by some author as a relation extractioin dataset.

2.1 ADE

contains medical descriptions of adverse effects of drug use.

2.2 SciERC

collected from 500 AI paper abstracts originally used for scientific knowledge graph construction.

2.3 GENIA

provides entity tags and coreferences for 1999 abstracts from the biomedical research literature with a substantial
portion of entities (24%) overlapping some other entity.
这里说的“其它实体” 是什么意思？

2.4 `WLPC`

provides entity, relations, and event annotations for 622 wet lab protocols.

2.5 `DocRED`

这里详细说一下DocRED数据集，文档级的信息抽取的大多数实验应该都会围绕这个进行。

2.5.1 数据集地址

2.5.2 数据集格式

数据集（只以 train_annotated.json 文件为例）中的内容如下：

信息抽取中常用的数据集_信息抽取

这意思就是：每条样本都是由如上四个部分组成。

vertexSet 它是由数个字典构成的list。每个字典中的内容如下：

{
    "name": "Zest Airways, Inc.",
    "type": "ORG",
    "sent_id": 0,
    "pos": [
        0,
        4
    ]
},

name：表示该mention的名字

pos：表示该name

type：表示该mention的type

sent_id ：表示该mention 所在的第几句话

pos：表示在该句中的 start postion 和 end postion

需要注意的是，vertexSet 是一个list，每项中的内容都围绕一个 entity/mention来描述。这个index 是从0开始。

labels
该段文本所具有的标签。
信息抽取中常用的数据集_ide_02
这个标签内容就说明：第0个实体（head）和第2个实体（tail）之间存在的关系是 P159。第0个实体和第2个实体分别可以从vertexSet中获取，在本文这个例子中就是： Zest Airways, Inc. ----P159----> Pasay city 。查表可得：P159 指的是总部位置headquarters location。