0 项目背景
信息抽取任务旨在从非结构化的自然语言文本中提取结构化信息。在本系列项目中,将讨论如何又好又快地实现一个简历信息提取任务。
在前置项目中,我们先用PaddleNLP提供的Taskflow API完成了简历基本信息的批量抽取;然后打通了原始数据集转化为UIE数据格式进行微调训练的路径。
作为该系列文章的第四篇,我们对微调训练好的简历文本抽取模型进行评估,并通过Taskflow API完成基于Streamlit在AI Studio上的快速部署应用。
1 环境准备
# 首次更新完以后,重启后方能生效
!pip install --upgrade paddlenlp
# 安装依赖库
!pip install python-docx
!pip install pypinyin
!pip install LAC
import datetime
import os
import cv2
import shutil
import numpy as np
import pandas as pd
import re
import json
from tqdm import tqdm
from docx import Document
from docx.shared import Inches
2 模型评估
# 解压训练保存的checkpoints文件
!tar -xvf /home/aistudio/data/data40148/192080.tar
对于训练过程,我们可以加载训练日志查看,位于PaddleNLP/applications/information_extraction/text/checkpoint/model_best/runs
目录下。
%cd PaddleNLP/applications/information_extraction/text/
/home/aistudio/PaddleNLP/applications/information_extraction/text
!python evaluate.py \
--model_path ./checkpoint/model_best/checkpoint-6000 \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 1024
[32m[2023-01-28 13:05:55,376] [ INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best/checkpoint-6000'.[0m
[32m[2023-01-28 13:05:55,399] [ INFO][0m - loading configuration file ./checkpoint/model_best/checkpoint-6000/config.json[0m
[32m[2023-01-28 13:05:55,400] [ INFO][0m - Model config ErnieConfig {
"architectures": [
"UIE"
],
"attention_probs_dropout_prob": 0.1,
"dtype": "float32",
"enable_recompute": false,
"fuse": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 2048,
"model_type": "ernie",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"paddlenlp_version": null,
"pool_act": "tanh",
"task_id": 0,
"task_type_vocab_size": 3,
"type_vocab_size": 4,
"use_task_id": true,
"vocab_size": 40000
}
[0m
W0128 13:05:56.973994 21484 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0128 13:05:56.976924 21484 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[32m[2023-01-28 13:05:57,965] [ INFO][0m - All model checkpoint weights were used when initializing UIE.
[0m
[32m[2023-01-28 13:05:57,965] [ INFO][0m - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best/checkpoint-6000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.[0m
[32m[2023-01-28 13:13:49,110] [ INFO][0m - -----------------------------[0m
[32m[2023-01-28 13:13:49,110] [ INFO][0m - Class Name: all_classes[0m
[32m[2023-01-28 13:13:49,110] [ INFO][0m - Evaluation Precision: 0.96656 | Recall: 0.96950 | F1: 0.96803[0m
[0m
3 模型预测
我们可以使用paddlenlp.Taskflow
装载定制模型,通过task_path
指定模型权重文件的路径,路径下需要包含训练好的模型权重文件model_state.pdparams
。
from pprint import pprint
from paddlenlp import Taskflow
# 设置实体抽取信息
schema = ['姓名', '出生年月', '电话', '性别', '项目名称', '项目责任', '项目时间', '籍贯', '政治面貌', '落户市县', '毕业院校', '学位', '毕业时间', '工作时间', '工作内容', '职务', '工作单位']
def get_paragraphs_text(path):
document = Document(path)
# 有的简历是表格式样的,因此,不仅需要提取正文,还要提取表格
col_keys = [] # 获取列名
col_values = [] # 获取列值
index_num = 0
# 表格提取中,需要添加一个去重机制
fore_str = ""
cell_text = ""
for table in document.tables:
for row_index,row in enumerate(table.rows):
for col_index,cell in enumerate(row.cells):
if fore_str != cell.text:
if index_num % 2==0:
col_keys.append(cell.text)
else:
col_values.append(cell.text)
fore_str = cell.text
index_num +=1
# 避免使用换行符
cell_text += cell.text + ';'
# 提取正文文本
paragraphs_text = ""
for paragraph in document.paragraphs:
# 拼接一个list,包括段落的结构和内容,避免使用换行符
paragraphs_text += paragraph.text + ";"
# 剔除掉返回内容中多余的空格、tab、换行符
cell_text = cell_text.replace('\n', ';').replace(' ', '').replace('\t', '')
paragraphs_text = paragraphs_text.replace('\n', ';').replace(' ', '').replace('\t', '')
return cell_text, paragraphs_text
# 在网上随机找个简历文件测试,也可以在一开始就划分好测试集,用测试集测试。
cell_text, paragraphs_text = get_paragraphs_text('/home/aistudio/040f6b50-f0d8-d00a-4d3f-3a067ae125ac.docx')
text_content = cell_text + paragraphs_text
# 设定抽取目标和定制化模型权重路径
my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best/checkpoint-6000')
pprint(my_ie(text_content))
[2023-01-28 13:16:51,188] [ INFO] - loading configuration file ./checkpoint/model_best/checkpoint-6000/config.json
[2023-01-28 13:16:51,193] [ INFO] - Model config ErnieConfig {
"architectures": [
"UIE"
],
"attention_probs_dropout_prob": 0.1,
"dtype": "float32",
"enable_recompute": false,
"fuse": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 2048,
"model_type": "ernie",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"paddlenlp_version": null,
"pool_act": "tanh",
"task_id": 0,
"task_type_vocab_size": 3,
"type_vocab_size": 4,
"use_task_id": true,
"vocab_size": 40000
}
W0128 13:16:51.722162 20402 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0128 13:16:51.725221 20402 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-01-28 13:16:52,604] [ INFO] - All model checkpoint weights were used when initializing UIE.
[2023-01-28 13:16:52,607] [ INFO] - All the weights of UIE were initialized from the model checkpoint at ./checkpoint/model_best/checkpoint-6000.
If your task is similar to the task the model of the checkpoint was trained on, you can already use UIE for predictions without further training.
[2023-01-28 13:16:52,610] [ INFO] - Converting to the inference model cost a little time.
[2023-01-28 13:17:01,772] [ INFO] - The inference model save in the path:./checkpoint/model_best/checkpoint-6000/static/inference
[2023-01-28 13:17:03,133] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best/checkpoint-6000'.
[{'出生年月': [{'end': 19,
'probability': 0.399177956089261,
'start': 11,
'text': '1995年10月'}],
'姓名': [{'end': 6,
'probability': 0.9966658831773216,
'start': 3,
'text': '白明明'}],
'学位': [{'end': 1750,
'probability': 0.9845688712125877,
'start': 1748,
'text': '学士'}],
'工作单位': [{'end': 1006,
'probability': 0.4594565730820932,
'start': 997,
'text': '杭州法哲贸易有限公'}],
'毕业时间': [{'end': 1845,
'probability': 0.9165512587924916,
'start': 1838,
'text': '2014.06'}],
'电话': [{'end': 164,
'probability': 0.9193561849170742,
'start': 153,
'text': '13800111000'}],
'籍贯': [{'end': 35,
'probability': 0.9961209590816047,
'start': 29,
'text': '广东省广州市'}]}]
'probability': 0.9961209590816047,
'start': 29,
'text': '广东省广州市'}]}]
4 在线部署
接下来,我们使用AI Studio提供的应用创建工具进行在线部署。
注意:由于部署环境的paddlenlp版本与项目有些不一致,需要在部署脚本中强制升级。
os.system("pip install --upgrade paddlenlp")
部署脚本请查看untitled.streamlit.py
文件。效果展示如下:
5 小结
在本项目中,我们成功完成了简历文本抽取模型微调后的评估、测试和部署应用。
在后续项目中,我们将开始探索图片格式简历文档抽取模型的微调,并进一步完善简历抽取应用的在线部署功能。