Page 214 - 《软件学报》2025年第4期
P. 214

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
                 2025,36(4):1620−1636 [doi: 10.13328/j.cnki.jos.007219] [CSTR: 32375.14.jos.007219]  http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563



                                                           *
                 面向低资源关系抽取的自训练方法

                 郁俊杰  1 ,    王    星  2 ,    陈文亮  1 ,    张    民  1


                 1
                  (苏州大学 计算机科学与技术学院, 江苏 苏州 215006)
                 2
                  (腾讯  AI Lab, 广东 深圳 518000)
                 通信作者: 陈文亮, E-mail: wlchen@suda.edu.cn

                 摘 要: 自训练是缓解标注数据不足问题的常见方法, 其通常做法是利用教师模型去获取高置信度的自动标注数
                 据作为可靠数据. 然而在低资源场景关系抽取任务上, 该方法不仅存在教师模型泛化能力差的问题, 而且受到关系
                 抽取任务中易混淆关系类别的影响, 导致难以从自动标注数据中有效地识别出可靠数据, 同时产生大量难以利用
                 的低置信度噪音数据. 因此, 提出一种有效利用低置信度数据的自训练方法                       ST-LRE (self-training approach for low-

                 resource relation extraction). 该方法一方面基于复述增强的预测方法来加强教师模型筛选可靠数据的能力; 另一方
                 面, 基于部分标注模式从低置信度数据中提炼出可利用的模糊数据. 基于模糊数据的候选类别集合, 提出了基于负
                 标签集合的负向训练方法. 最后, 为了支持可靠数据和模糊数据的融合训练, 提出一种支持正负向训练的联合方法.
                 在两个广泛使用的关系抽取数据集            SemEval2010 Task-8  和  Re-TACRED  的低资源场景上进行实验, ST-LRE    方法
                 取得显著且一致的提升.
                 关键词: 自然语言处理; 信息抽取; 关系抽取; 低资源; 自训练
                 中图法分类号: TP18

                 中文引用格式: 郁俊杰, 王星, 陈文亮, 张民. 面向低资源关系抽取的自训练方法. 软件学报, 2025, 36(4): 1620–1636. http://www.jos.
                 org.cn/1000-9825/7219.htm
                 英文引用格式: Yu JJ, Wang X, Chen WL, Zhang M. Self-training Approach for Low-resource Relation Extraction. Ruan Jian Xue
                 Bao/Journal of Software, 2025, 36(4): 1620–1636 (in Chinese). http://www.jos.org.cn/1000-9825/7219.htm

                 Self-training Approach for Low-resource Relation Extraction
                         1
                                                   1
                                    2
                 YU Jun-Jie , WANG Xing , CHEN Wen-Liang , ZHANG Min 1
                 1
                 (School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
                 2
                 (Tencent AI Lab, Shenzhen 518000, China)
                 Abstract:  Self-training,  a  common  strategy  for  tackling  the  annotated-data  scarcity,  typically  involves  acquiring  auto-annotated  data  with
                 high  confidence  generated  by  a  teacher  model  as  reliable  data.  However,  in  low-resource  scenarios  for  Relation  Extraction  (RE)  tasks,  this
                 approach  is  hindered  by  the  limited  generalization  capacity  of  the  teacher  model  and  the  confusable  relational  categories  in  tasks.
                 Consequently,  efficiently  identifying  reliable  data  from  automatically  labeled  data  becomes  challenging,  and  a  large  amount  of  low-
                 confidence  noise  data  will  be  generalized.  Therefore,  this  study  proposes  a  self-training  approach  for  low-resource  relation  extraction  (ST-
                 LRE).  This  approach  aids  the  teacher  model  in  selecting  reliable  data  based  on  prediction  ways  of  paraphrases,  and  extracts  ambiguous
                 data  with  reliability  from  low-confidence  data  based  on  partially-labeled  modes.  Considering  the  candidate  categories  of  ambiguous  data,
                 this  study  proposes  a  negative  training  approach  based  on  the  set  of  negative  labels.  Finally,  a  unified  approach  capable  of  both  positive
                 and  negative  training  is  proposed  for  the  integrated  training  of  reliable  data  and  ambiguous  data.  In  the  experiments,  ST-LRE  consistently
                 demonstrates significant improvements in low-resource scenarios of two widely used RE datasets SemEval2010 Task-8 and Re-TACRED.
                 Key words:  natural language processing; information extraction; relation extraction; low-resource; self-training


                 *    基金项目: 国家自然科学基金  (62376177, 61936010)
                  收稿时间: 2023-10-10; 修改时间: 2024-01-18; 采用时间: 2024-04-19; jos 在线出版时间: 2024-07-03
                  CNKI 网络首发时间: 2024-07-11
   209   210   211   212   213   214   215   216   217   218   219