Page 347 - 《软件学报》2024年第6期
P. 347
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2024,35(6):2923−2935 [doi: 10.13328/j.cnki.jos.006927] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
中文医疗文本中的嵌套实体识别方法
闫璟辉 1 , 宗成庆 1,2 , 徐金安 1
1
(北京交通大学 计算机与信息工程学院, 北京 100091)
2
(模式识别国家重点研究室 (中国科学院 自动化研究所), 北京 100190)
通信作者: 宗成庆, E-mail: cqzong@nlpr.ia.ac.cn
摘 要: 实体识别是信息抽取的关键技术. 相较于普通文本, 中文医疗文本的实体识别任务往往面对大量的嵌套实
体. 以往识别实体的方法往往忽视了医疗文本本身所特有的实体嵌套规则而直接采用序列标注方法, 为此, 提出一
种融合实体嵌套规则的中文实体识别方法. 所提方法在训练过程中将实体的识别任务转化为实体的边界识别与边
界首尾关系识别的联合训练任务, 在解码过程中结合从实际医疗文本中所总结出来的实体嵌套规则对解码结果进
行过滤, 从而使得识别结果能够符合实际文本中内外层实体嵌套组合的组成规律. 在公开的医疗文本实体识别的
实验上取得良好的效果. 数据集上的实验表明, 所提方法在嵌套类型实体识别性能上显著优于已有的方法, 在整体
准确率方面比最先进的方法提高 0.5%.
关键词: 实体识别; 中文文本; 医疗领域; 嵌套实体识别; 边界识别
中图法分类号: TP18
中文引用格式: 闫璟辉, 宗成庆, 徐金安. 中文医疗文本中的嵌套实体识别方法. 软件学报, 2024, 35(6): 2923–2935. http://www.jos.
org.cn/1000-9825/6927.htm
英文引用格式: Yan JH, Zong CQ, Xu JA. Nested Entity Recognition Approach in Chinese Medical Text. Ruan Jian Xue Bao/Journal
of Software, 2024, 35(6): 2923–2935 (in Chinese). http://www.jos.org.cn/1000-9825/6927.htm
Nested Entity Recognition Approach in Chinese Medical Text
1
1,2
YAN Jing-Hui , ZONG Cheng-Qing , XU Jin-An 1
1
(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100091, China)
2
(National Laboratory of Pattern Recognition (Institute of Automation, Chinese Academy of Sciences), Beijing 100190, China)
Abstract: Entity recognition is a key technology for information extraction. Compared with ordinary text, the entity recognition of Chinese
medical text is often faced with a large number of nested entities. Previous methods of entity recognition often ignore the entity nesting
rules unique to medical text and directly use sequence annotation methods. Therefore, a Chinese entity recognition method that
incorporates entity nesting rules is proposed. This method transforms the entity recognition task into a joint training task of entity
boundary recognition and boundary first-tail relationship recognition in the training process and filters the results by combining the entity
nesting rules summarized from actual medical text in the decoding process. In this way, the recognition results are in line with the
composition law of the nested combinations of inner and outer entities in the actual text. Good results have been achieved in public
experiments on entity recognition of medical text. Experiments on the dataset show that the proposed method is significantly superior to
the existing methods in terms of nested-type entity recognition performance, and the overall accuracy is increased by 0.5% compared with
the state-of-the-art methods.
Key words: entity recognition; Chinese text; medical field; nested entity recognition; boundary detection
1 引 言
医疗文本的信息抽取研究具有重要的理论意义和应用价值, 其在医疗领域的众多应用中如智慧医疗 [1] 、大数
* 收稿时间: 2022-09-30; 修改时间: 2022-11-03; 采用时间: 2023-03-03; jos 在线出版时间: 2023-08-23
CNKI 网络首发时间: 2023-08-28