Page 214 - 《软件学报》2025年第4期
P. 214
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(4):1620−1636 [doi: 10.13328/j.cnki.jos.007219] [CSTR: 32375.14.jos.007219] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
面向低资源关系抽取的自训练方法
郁俊杰 1 , 王 星 2 , 陈文亮 1 , 张 民 1
1
(苏州大学 计算机科学与技术学院, 江苏 苏州 215006)
2
(腾讯 AI Lab, 广东 深圳 518000)
通信作者: 陈文亮, E-mail: wlchen@suda.edu.cn
摘 要: 自训练是缓解标注数据不足问题的常见方法, 其通常做法是利用教师模型去获取高置信度的自动标注数
据作为可靠数据. 然而在低资源场景关系抽取任务上, 该方法不仅存在教师模型泛化能力差的问题, 而且受到关系
抽取任务中易混淆关系类别的影响, 导致难以从自动标注数据中有效地识别出可靠数据, 同时产生大量难以利用
的低置信度噪音数据. 因此, 提出一种有效利用低置信度数据的自训练方法 ST-LRE (self-training approach for low-
resource relation extraction). 该方法一方面基于复述增强的预测方法来加强教师模型筛选可靠数据的能力; 另一方
面, 基于部分标注模式从低置信度数据中提炼出可利用的模糊数据. 基于模糊数据的候选类别集合, 提出了基于负
标签集合的负向训练方法. 最后, 为了支持可靠数据和模糊数据的融合训练, 提出一种支持正负向训练的联合方法.
在两个广泛使用的关系抽取数据集 SemEval2010 Task-8 和 Re-TACRED 的低资源场景上进行实验, ST-LRE 方法
取得显著且一致的提升.
关键词: 自然语言处理; 信息抽取; 关系抽取; 低资源; 自训练
中图法分类号: TP18
中文引用格式: 郁俊杰, 王星, 陈文亮, 张民. 面向低资源关系抽取的自训练方法. 软件学报, 2025, 36(4): 1620–1636. http://www.jos.
org.cn/1000-9825/7219.htm
英文引用格式: Yu JJ, Wang X, Chen WL, Zhang M. Self-training Approach for Low-resource Relation Extraction. Ruan Jian Xue
Bao/Journal of Software, 2025, 36(4): 1620–1636 (in Chinese). http://www.jos.org.cn/1000-9825/7219.htm
Self-training Approach for Low-resource Relation Extraction
1
1
2
YU Jun-Jie , WANG Xing , CHEN Wen-Liang , ZHANG Min 1
1
(School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
2
(Tencent AI Lab, Shenzhen 518000, China)
Abstract: Self-training, a common strategy for tackling the annotated-data scarcity, typically involves acquiring auto-annotated data with
high confidence generated by a teacher model as reliable data. However, in low-resource scenarios for Relation Extraction (RE) tasks, this
approach is hindered by the limited generalization capacity of the teacher model and the confusable relational categories in tasks.
Consequently, efficiently identifying reliable data from automatically labeled data becomes challenging, and a large amount of low-
confidence noise data will be generalized. Therefore, this study proposes a self-training approach for low-resource relation extraction (ST-
LRE). This approach aids the teacher model in selecting reliable data based on prediction ways of paraphrases, and extracts ambiguous
data with reliability from low-confidence data based on partially-labeled modes. Considering the candidate categories of ambiguous data,
this study proposes a negative training approach based on the set of negative labels. Finally, a unified approach capable of both positive
and negative training is proposed for the integrated training of reliable data and ambiguous data. In the experiments, ST-LRE consistently
demonstrates significant improvements in low-resource scenarios of two widely used RE datasets SemEval2010 Task-8 and Re-TACRED.
Key words: natural language processing; information extraction; relation extraction; low-resource; self-training
* 基金项目: 国家自然科学基金 (62376177, 61936010)
收稿时间: 2023-10-10; 修改时间: 2024-01-18; 采用时间: 2024-04-19; jos 在线出版时间: 2024-07-03
CNKI 网络首发时间: 2024-07-11