Page 118 - 《软件学报》2021年第12期
P. 118

软件学报 ISSN 1000-9825, CODEN RUXUEW                                        E-mail: jos@iscas.ac.cn
         Journal of Software,2021,32(12):3782−3801 [doi: 10.13328/j.cnki.jos.006119]   http://www.jos.org.cn
         ©中国科学院软件研究所版权所有.                                                          Tel: +86-10-62562563


                                                    ∗
         篇章视角的汉语零指代语料库构建

                       1
               1,2
         孔   芳 ,   葛海柱 ,   周国栋  1,2
         1
          (苏州大学  计算机科学与技术学院  自然语言处理实验室,江苏  苏州  215006)
         2 (江苏省计算机信息处理技术重点实验室,江苏  苏州  215006)
         通讯作者:  周国栋, E-mail: gdzhou@suda.edu.cn

         摘   要:  零指代是汉语中普遍存在的一个现象,在汉英机器翻译、文本摘要以及阅读理解等众多自然语言处理任
         务中都起着重要作用,目前已成为自然语言处理领域的一个研究热点.提出了篇章视角的汉语零指代表示体系,从服
         务于篇章分析的角度出发,首先以基本篇章单元为考察对象,判别其是否包含零元素;再根据零元素在基本篇章单元
         中承担的角色将零元素划分成主干类和修饰类两类;接着以段落对应的篇章修辞结构树为考察指代关系的基本单
         元,依据先行词与零元素间的位置关系将指代关系分成基本篇章单元内和基本篇章单元间两种,并针对基本篇章单
         元间的指代关系,根据零元素对应的先行词的状况将指代关系分成实体类、事件类、组合类和其他等 4 类;最后,基
         于篇章视角的汉语零指代表示体系,选取汉语树库 CTB、连接词驱动的汉语篇章树库 CDTB 和 OntoNotes 语料中
         重叠的 325 篇文本进行了汉语零指代的标注,构建了服务于篇章分析的汉语零指代语料库.一方面,借助系统检测来
         说明所提出的表示体系合理有效,构造的语料库质量上乘;另一方面构建了完整的汉语零指代消解基准平台,从可计
         算的角度验证了所构建的汉语零指代语料库能够为篇章视角的汉语零指代研究提供必要的支撑.
         关键词:  零指代;语料库构建;篇章分析;基本篇章单元;零元素
         中图法分类号: TP18

         中文引用格式:  孔芳,葛海柱,周国栋.篇章视角的汉语零指代语料库构建.软件学报,2021,32(12):3782−3801.  http://www.jos.
         org.cn/1000-9825/6119.htm
         英文引用格式: Kong F, Ge HZ, Zhou GD. Corpus construction for Chinese zero anaphora from discourse perspective. Ruan Jian
         Xue Bao/Journal of Software, 2021,32(12):3782−3801 (in Chinese). http://www.jos.org.cn/1000-9825/6119.htm
         Corpus Construction for Chinese Zero Anaphora from Discourse Perspective

                                1
                   1,2
         KONG Fang ,   GE Hai-Zhu ,  ZHOU Guo-Dong 1,2
         1 (Laboratory for Natural Language Processing, School of Computer Science and Technology, Soochow University, Suzhou 215006, China)
         2 (Jiangsu Key Laboratory of Computer Information Processing Technology, Suzhou 215006, China)
         Abstract:    As a common phenomenon in Chinese, zero anaphora plays an important role in many natural language processing tasks, such
         as machine translation, text summarization and machine reading comprehension. Currently, it has become a research hotspot in the field of
         natural language processing. Towards  better  discourse analysis,  this study  proposes a  representation architecture  for Chinese zero
         anaphora from the discourse perspective. Firstly, the elementary discourse unit is taken as the investigation object to determine whether it
         contains zero elements. Secondly, according to the roles of zero elements in the elementary discourse unit, the zero elements are divided
         into two categories: the core type and the modifier type. Thirdly, the discourse rhetorical tree of the paragraph is used as the basic unit to
         evaluate the Chinese zero coreferential relationship. According to the positional relationship between the antecedent and the zero element,
         the coreferential relationship is classified into two types, i.e., Intra-EDU and Inter-EDU. After that, for Inter-EDU type, the coreferential

            ∗  基金项目:  国家自然科学基金(61876118, 61751206);  江苏高校优势学科建设工程
              Foundation item: National Natural Science Foundation of China (61876118, 61751206); A Project Funded by the Priority Academic
         Program Development of Jiangsu Higher Education Institutions (PAPD)
              收稿时间: 2020-05-15;  修改时间: 2020-06-22;  采用时间: 2020-07-17
   113   114   115   116   117   118   119   120   121   122   123