Page 225 - 《软件学报》2021年第7期
P. 225

牛长安  等:基于指针生成网络的代码注释自动生成模型                                                      2143


                 3 (College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
                 Abstract:    Code comments plays an important role in software quality assurance, which can improve the readability of source code and
                 make it easier to understand, reuse, and maintain. However, for various reasons, sometimes developers do not add the necessary comments,
                 which  make developers  always  waste  a lot of time understanding the  source  code  and greatly reduces  the efficiency of  software
                 maintenance. In recent years, lots of work using machine learning to automatically generate corresponding comments for the source code.
                 These methods extract such information as code sequence and structure, and then utilize sequence to sequence (seq2seq) neural model to
                 generate the corresponding comments, which have achieved sound results. However, Hybrid-DeepCom, the state-of-the-art code comment
                 generation  model,  is still deficient in two  aspects.  The  first is that it  may break the  code  structure during preprocessing, resulting in
                 inconsistent input information of different instances, making the model learning effect poor; the second is that due to the limitations of the
                 seq2seq  model,  it  is not  able to generate out-of-vocabulary  word (OOV  word) in the  comment. For  example, variable names,  method
                 names, and other identifiers that appear very infrequently in the source code are usually OOV words, without them, comments would be
                 difficult to be understood. In order to solve this problem, the automatic comment generation model named CodePtr is proposed in this
                 study. On the one hand, a complete source code encoder is added to solve the problem of code structure being broken; on the other hand,
                 the pointer-generator network module is introduced to realize the automatic switch between the generated word mode and the copy word
                 mode in each step of decoding, especially when encountering the identifier with few times in the input, the model can directly copy it to
                 the output, so as to solve the problem of not being able to generate OOV word. Finally, this study compares the CodePtr and Hybrid-
                 DeepCom models through experiments on large data sets. The results show that when the size of the vocabulary is 30 000, CodePtr is
                 increased by 6% on average in translation performance metrics, and the effect of OOV word processing is improved by nearly 50%, which
                 fully demonstrates the effectiveness of CodePtr model.
                 Key words:    software  quality assurance;  source code  comments generation;  neural  network; out-of-vocabulary word;  pointer-generator
                          network

                    可读性是软件源代码的重要质量属性之一,可读性还与其他质量属性,如可重用性、可维护性、可靠性、复杂
                                                                                            [1]
                 性和可移植性等度量标准有显著的关系,早期研究结果表明,代码注释提高了银行家算法的可读性 .Tashtoush 等
                  [2]
                 人 研究量化了众多代码特性对源代码可读性的影响,发现其中注释(comment)对增强源代码的可读性的积极
                 影响排名第三,仅次于有意义的名称(meaningful name)和一致性(consistency).可见,代码注释在帮助开发者理解
                 源代码的过程中发挥着重要的作用,可以使开发人员或维护人员更快地理解源代码,从而更方便、更高效地重
                 用和维护该段代码.因此,在一段高质量的代码中,相对应的注释是必不可少的.
                    虽然已有越来越多的公司和机构呼吁或要求开发人员在编写代码时遵循代码规范,为每一段代码添加注
                 释.但是由于源代码编写者缺乏意识、缺少约束或时间问题等各种各样的原因,仍然有绝大部分的代码缺少相
                                                                                                  [3]
                 应的注释.研究表明,在软件开发和维护过程中,开发人员仍然会花费约 59%的时间在程序的理解上 .此外,
                        [4]
                 Fluri 等人 通过对 ArgoUML、Azureus 和 JDT 核心这 3 个开源系统中代码和注释随时间共同演化关系的研究,
                 发现代码和注释很少共同演变,新添加的代码几乎没有被注释过,这表明,即使在拥有注释的源代码中,有很多
                 注释仍然是过时的.可以看出,当前的很多源代码没有对应的正确的注释,如何为这些源代码自动生成注释成为
                 一项非常有意义的工作.近年来,研究人员在不断尝试通过机器学习等方法实现源代码注释的自动生成.
                    代码注释自动生成又称为代码总结或代码摘要(code summary),是指对源代码进行的文字描述,关键在于
                 将用编程语言编写的源代码翻译为自然语言,这不仅需要描述源代码的功能,还要描述开发人员写这段代码
                                     [6]
                            [5]
                 时的设计意图 .Hu 等人 将代码注释自动生成任务表述为一种机器翻译任务.早期的代码注释自动生成大
                 多采用信息检索(information retrieval,简称 IR)算法    [710] ,近年来,随着深度学习的发展,采用深度神经网络的
                 NLP 算法 [6,1117] 取得了较为突出的成绩,其中大多模型都是基于一种称为 seq2seq              [18] 的 RNN 框架.seq2seq 模型
                 广泛用于机器翻译、对话系统、自动文摘等 NLP 领域,通常由编码器(encoder)和解码器(decoder)两个 RNN 组
                 成,通过编码器对输入序列提取出一个定长的特征向量,再将特征向量输入到解码器中进行解码,输出概率最大
                 的句子.
                    虽然基于深度神经网络的 NLP 算法取得了较为突出的成绩,但是这类算法都有一个难以避免的缺点:即需
                 要在训练模型时基于训练数据集构建词库,在训练数据输入到神经网络之前,所有的单词都会转换为其在词库
   220   221   222   223   224   225   226   227   228   229   230