类别名称和定义

原创

茗君（Major_S） 2021-08-05 11:12:45 ©著作权

©著作权归作者所有：来自51CTO博客作者茗君（Major_S）的原创作品，请联系作者获取转载授权，否则将追究法律责任

类别名称和定义

疾病相关：

1、疾病名称 (Disease)，如I型糖尿病。
2、病因(Reason)，疾病的成因、危险因素及机制。比如“糖尿病是由于胰岛素抵抗导致”，胰岛素抵抗是属于病因。
3、临床表现 (Symptom)，包括症状、体征，病人直接表现出来的和需要医生进行查体得出来的判断。如"头晕" “便血” 等。
4、检查方法(Test)，包括实验室检查方法，影像学检查方法，辅助试验，对于疾病有诊断及鉴别意义的项目等，如甘油三酯。
5、检查指标值(Test_Value)，指标的具体数值，阴性阳性，有无，增减，高低等，如”>11.3 mmol/L”。

治疗相关：

6、药品名称(Drug)，包括常规用药及化疗用药，比如胰岛素。
7、用药频率(Frequency)，包括用药的频率和症状的频率，比如一天两次。
8、用药剂量（Amount），比如500mg/d。
9、用药方法（Method）：比如早晚，餐前餐后，口服，静脉注射，吸入等。
10、非药治疗(Treatment)，在医院环境下进行的非药物性治疗，包括放疗，中医治疗方法等，比如推拿、按摩、针灸、理疗，不包括饮食、运动、营养等。
11、手术（Operation），包括手术名称，如代谢手术等。
12、不良反应（SideEff），用药后的不良反应。

常规实体：

13、部位（Anatomy），包括解剖部位和生物组织，比如人体各个部位和器官，胰岛细胞。
14、程度（level），包括病情严重程度，治疗后缓解程度等。
15、持续时间(Duration)，包括症状持续时间，用药持续时间，如“头晕一周”的“一周”。

实体关系类别名称

1、检查方法 -> 疾病（Test_Disease）
2、临床表现 -> 疾病（Symptom_Disease）
3、非药治疗 -> 疾病（Treatment_Disease）
4、药品名称 -> 疾病（Drug_Disease）
5、部位 -> 疾病（Anatomy_Disease）
6、用药频率 -> 药品名称（Frequency_Drug）
7、持续时间 -> 药品名称（Duration_Drug）
8、用药剂量 -> 药品名称（Amount_Drug）
9、用药方法 -> 药品名称（Method_Drug）
10、不良反应 -> 药品名称（SideEff-Drug）

import os
import numpy as np
import pandas as pd
import codecs
import re
from collections import namedtuple, Counter, defaultdict
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

1. 读取标记数据

data_folder = r'./dataset/ann'
ann_files = [os.path.join(data_folder, file) for file in os.listdir(data_folder) if file.endswith('ann')]
ann_files[:10]

类别名称和定义_IT

Entity = namedtuple('Entity', ['tag', 'context'])
Relation = namedtuple('Relation', ['relation', 'entity1', 'entity2'])

def get_tag_item(tag_item):
    temp = tag_item.split('\t')
    tag_index = temp[0]
    tag_context = temp[-1].replace(' ', '').lower()
    tag = temp[1].split(' ')[0]
    return tag_index, tag, tag_context

def get_relation_item(relation_item):
    relation_detail = relation_item.split('\t')[-1]
    temp = relation_detail.split(' ')
    relation = temp[0]
    arg1 = temp[1].split(':')[1]
    arg2 = temp[2].split(':')[1]
    return relation, arg1, arg2

def read_ann_file(path):
    with codecs.open(path, 'r', encoding='utf-8') as f:
        context = f.read()
        lines = [line for line in context.split('\n') if line]
        
    # 抽取实体
    tag_items = [line for line in lines if line[0] == 'T']
    tag_dict = {}
    tags = []
    for tag_item in tag_items:
        tag_index, tag, tag_context = get_tag_item(tag_item)
        tag_dict[tag_index] = tag_context
        tags.append(Entity(tag=tag, context=tag_context))
        
    # 抽取实体关系对  
    relation_items = [line for line in lines if line[0] == 'R']
    relations = []
    for relation_item in relation_items:
        relation, arg1, arg2 = get_relation_item(relation_item)
        entity1 = tag_dict.get(arg1, '')
        entity2 = tag_dict.get(arg2, '')
        if entity1 and entity2:
            relations.append(Relation(relation=relation, entity1=entity1, entity2=entity2))
    return tags, relations

total_tags = []
total_relations = []
for ann_file in ann_files:
    tags, relations = read_ann_file(ann_file)
    total_tags.extend(tags)
    total_relations.extend(relations)