Page 347 - 《软件学报》2024年第6期
P. 347

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
                 Journal of Software,2024,35(6):2923−2935 [doi: 10.13328/j.cnki.jos.006927]  http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563



                                                              *
                 中文医疗文本中的嵌套实体识别方法

                 闫璟辉  1 ,    宗成庆  1,2 ,    徐金安  1


                 1
                  (北京交通大学 计算机与信息工程学院, 北京 100091)
                 2
                  (模式识别国家重点研究室 (中国科学院 自动化研究所), 北京 100190)
                 通信作者: 宗成庆, E-mail: cqzong@nlpr.ia.ac.cn

                 摘 要: 实体识别是信息抽取的关键技术. 相较于普通文本, 中文医疗文本的实体识别任务往往面对大量的嵌套实
                 体. 以往识别实体的方法往往忽视了医疗文本本身所特有的实体嵌套规则而直接采用序列标注方法, 为此, 提出一
                 种融合实体嵌套规则的中文实体识别方法. 所提方法在训练过程中将实体的识别任务转化为实体的边界识别与边
                 界首尾关系识别的联合训练任务, 在解码过程中结合从实际医疗文本中所总结出来的实体嵌套规则对解码结果进
                 行过滤, 从而使得识别结果能够符合实际文本中内外层实体嵌套组合的组成规律. 在公开的医疗文本实体识别的
                 实验上取得良好的效果. 数据集上的实验表明, 所提方法在嵌套类型实体识别性能上显著优于已有的方法, 在整体
                 准确率方面比最先进的方法提高           0.5%.
                 关键词: 实体识别; 中文文本; 医疗领域; 嵌套实体识别; 边界识别
                 中图法分类号: TP18

                 中文引用格式: 闫璟辉, 宗成庆, 徐金安. 中文医疗文本中的嵌套实体识别方法. 软件学报, 2024, 35(6): 2923–2935. http://www.jos.
                 org.cn/1000-9825/6927.htm
                 英文引用格式: Yan JH, Zong CQ, Xu JA. Nested Entity Recognition Approach in Chinese Medical Text. Ruan Jian Xue Bao/Journal
                 of Software, 2024, 35(6): 2923–2935 (in Chinese). http://www.jos.org.cn/1000-9825/6927.htm

                 Nested Entity Recognition Approach in Chinese Medical Text
                           1
                                           1,2
                 YAN Jing-Hui , ZONG Cheng-Qing , XU Jin-An 1
                 1
                 (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100091, China)
                 2
                 (National Laboratory of Pattern Recognition (Institute of Automation, Chinese Academy of Sciences), Beijing 100190, China)
                 Abstract:  Entity recognition is a key technology for information extraction. Compared with ordinary text, the entity recognition of Chinese
                 medical  text  is  often  faced  with  a  large  number  of  nested  entities.  Previous  methods  of  entity  recognition  often  ignore  the  entity  nesting
                 rules  unique  to  medical  text  and  directly  use  sequence  annotation  methods.  Therefore,  a  Chinese  entity  recognition  method  that
                 incorporates  entity  nesting  rules  is  proposed.  This  method  transforms  the  entity  recognition  task  into  a  joint  training  task  of  entity
                 boundary  recognition  and  boundary  first-tail  relationship  recognition  in  the  training  process  and  filters  the  results  by  combining  the  entity
                 nesting  rules  summarized  from  actual  medical  text  in  the  decoding  process.  In  this  way,  the  recognition  results  are  in  line  with  the
                 composition  law  of  the  nested  combinations  of  inner  and  outer  entities  in  the  actual  text.  Good  results  have  been  achieved  in  public
                 experiments  on  entity  recognition  of  medical  text.  Experiments  on  the  dataset  show  that  the  proposed  method  is  significantly  superior  to
                 the  existing  methods  in  terms  of  nested-type  entity  recognition  performance,  and  the  overall  accuracy  is  increased  by  0.5%  compared  with
                 the state-of-the-art methods.
                 Key words:  entity recognition; Chinese text; medical field; nested entity recognition; boundary detection

                  1   引 言

                    医疗文本的信息抽取研究具有重要的理论意义和应用价值, 其在医疗领域的众多应用中如智慧医疗                                  [1] 、大数


                 *    收稿时间: 2022-09-30; 修改时间: 2022-11-03; 采用时间: 2023-03-03; jos 在线出版时间: 2023-08-23
                  CNKI 网络首发时间: 2023-08-28
   342   343   344   345   346   347   348   349   350   351   352