Page 26 - 《软件学报》2021年第11期

P. 26

3352 Journal of Software 软件学报 Vol.32, No.11, November 2021

developers according to the existing code lines, then it will not only help the developer to complete the development task better, but also
improve the efficiency of software development. However, most existing approaches only focus on code repair or completion, which
seldom considers how to meet the demand of recommending code lines based on contextual information. To solve this problem, a feasible
solution is using deep learning methods to extract the relevant context factors of code lines through mining hidden context information
based on the existing massive source data. Therefore, this study proposes a novel approach based on deep learning for onsite programming.
In this approach, the contextual relationships among various code lines are learned from existing large-scale code data sets and then Top-N
code lines are recommended to programmers. The approach utilizes the RNN encoder-decoder framework, which can encode several lines
of code to a vector with context-aware information, and then obtain the Top-N new code lines based on the context vector. Finally, the
approach is empirically evaluated with a large-scale code line data set collected from the open source platform. The study results show
that the proposed approach can recommend the relevant code lines to developers according to the existing context, and the accuracy value
is approaching to 60%. In addition, the MRR value is about 0.3, indicating that the recommended items are ranked in the top of the N
recommended results.
Key words: onsite programming; source code context; code line; deep learning; RNN Encoder-Decoder

由于人们对于软件的功能需求日益丰富,软件的规模越来越大,结构日益复杂.在软件开发过程中,程序开
发人员很可能遇到一些软件编写困难的情况,比如某些不常见的功能如何实现.另外,巨大的需求对于程序编写
的准确性和开发效率的要求日益提高.在软件开发的编程现场,有大量与当前开发任务相关的信息,如代码上下
文信息、用户开发意图等.因此在开发过程中,如果开发人员能够充分利用编程现场的已有信息,获得当前代码
行的可能情况,就能进行参考、改进或直接复用,对提高程序编写的准确率和效率会有很大的帮助.这也是智能
化软件开发的重要特点.
[1]
在实际开发过程中,开发者通常会选择搜索引擎查询需要的代码 .但利用搜索引擎搜索需要确切的功能
[2]
[3]
性描述 ,而对一个单一代码行而言并不具备一个完整功能.并且由于编程语言的复杂性和多样性 ,比如数据
类型、结构的多样性以及开发人员自定义变量的差异性等,导致简单的文本匹配对代码行较难有很好的效
[6]
果 [4,5] ,所以查询结果通常不尽如人意.已有的一些方法通常是进行代码修复或者代码补全 ,这类工作粒度更
细,并且对自动补全功能的限制性较高,主要针对确定的 API 或者已经定义的变量之类进行补全或推荐 [7,8] ,不能
实现完整的代码行推荐.为了解决这个问题,一种可行的方案是基于已有的海量源码数据,利用深度学习析取代
[9]
码行的相关上下文因子,挖掘隐含上下文信息,为精准推荐提供基础.受此启发,本文提出一种基于深度学习的
编程现场上下文深度感知的代码行推荐方法(deep awareness for code line recommendation,简称 DA4CLR),其
中,编程现场是指软件生产中与当前编码相关的要素集合.该方法利用编程现场已有的源码数据和任务数据对
当前代码行进行预测并推荐,借助深度学习模型从处理过的大规模代码行上下文数据集中学习一种隐含的上
下文关联关系,作为推荐的基础.
DA4CLR 使用 RNN Encoder-Decoder 的框架 [10] .该框架是一种 Sequence-to-Sequence 框架,其编码-解码结
[5]
构对解决 Sequence2Sequence 问题有独到的优势 .编码器能够将输入序列进行编码,进而得到一个固定长度的
上下文向量,该向量包含了输入序列的一些信息.而解码器能够根据编码器生成的上下文向量得到输出序列.为
了对推荐方法的有效性进行检测,本文从 GitHub 上关注度较高的项目和部分认可度较好的 jar 包中收集了数百
万个带上下文的代码行,并选择其中的部分数据作为测试数据集,在准确率和 MRR 两个指标上对方法进行测
试.最终,测试结果表明了编程现场上下文深度感知代码行推荐方法的可行性和有效性.考虑到已有研究工作的
困难和不足,本论文工作的主要贡献在于以下 3 个方面.
1) 针对源码提出了一种面向开源源码大数据的数据质量评估方法.从不同的维度和粒度级别,分析了从
一个 Java project 切分成多个 Java method 过程中各个步骤的质量问题,最终给出对单个方法块的质量
评估结果.
2) 利用深度学习模型,从已有的带有代码行上下文的大规模开源数据集中学习潜在的一般性代码行上
下文模式,并应用到编程现场,利用现场的上下文环境将当前可能的代码行推荐给开发人员.
3) 通过编程现场任务数据捕捉开发者意图,并利用语义相似度匹配对推荐结果进行优先级调整,更好地

21 22 23 24 25 26 27 28 29 30 31