Page 325 - 《软件学报》2021年第7期
P. 325
李玫 等:面向代码相似性检测的相似哈希改进方法 2243
3 (School of Software and Microelectronics, Peking University, Beijing 102600, China)
4 (Zhejiang Laboratory, Hangzhou 311121, China)
Abstract: Code similarity detection is one of the basic tasks in software engineering. It plays an effective and fundamental role in
plagiarism, software licensing violation, software reuse analysis, and vulnerability discovery. With the popularization of open source
software, open source code has been frequently applied to multiple areas, bringing new challenges to traditional code similarity detection
methods.Some existing detection methods based on lexical, grammar, and semantics have problems such as high computational
complexity, dependence on analytical tools, high resource consumption, poor portability, having a large number of comparison candidates,
and so on. Simhash-based code similarity detection algorithm reduces the dimension of the code to a fingerprint, which can realize fast
near-duplicate file retrieval on a large dataset. It controls the similarity of matched results through the Hamming distance threshold. This
study verifies existed simhash algorithm with line granularity through experiments, and discovers the line coverage problem in large-scale
datasets. Inspired by the idea of TF-IDF algorithm, a language-based line-filtering optimization method is proposed to deal with it. Line
sequences of code files is filtered through line filters in various languages to eliminate the impact of lines that appear frequently but
contain less semantic information on the results. After a series of comparative experiments, this study verifies that the enhanced method
always achieves high precision with Hamming distance threshold set from 0 to 8. Compared to the method before enhancement, the
proposed method improves the precision by 98.6% and 52.2% on two different datasets with threshold set to 8. Based on the large-scale
code database built from 386 486 112 files in 1.3 million open source projects, it is verified that the proposed method can, keeping the
high precision of 97%, efficiently detect similar files with an average speed of 0.43s per file.
Key words: code similarity detection; code homology analyze; big data; simhash; code fingerprint generation
现如今,随着软件代码开源化的日益普及,开源代码量正以光速增长.无论是在企业还是科研单位,越来越
多的开发者选择复制粘贴已有代码以提高软件开发效率.各大开源社区的蓬勃发展,吸引了无数用户共享共建
开源项目,也有无数衍生软件随之产生,作用于生产生活与科研工作中.开源代码的引入主要有以下两点优势:
一方面,对现有代码的直接使用大幅度提高了生产效率,降低了成本;另一方面,著名的开源项目通常由国内外
优秀开发者、公司进行开发和维护,软件质量高且结构规范.
然而,随着软件的不断更新,软件功能不断增加,这些重复代码和克隆代码对软件质量、可用性和可维护性
的负面影响愈发突显.从开源项目引入的代码降低了软件开发人员对软件系统整体的理解和把控,外来代码与
软件系统本身代码之间可能会出现冲突,开源代码中的漏洞也可能会随着代码的复制被引入到项目中,带来安
全隐患.针对这一问题,研究人员通常使用代码相似性检测技术来检测软件工程中的相似代码.
自 20 世纪 70 年代起,学术界涌现了大量的代码相似性检测工具和方法,广泛应用于代码克隆检测、软件
许可证违反检测、软件抄袭检测、漏洞缺陷发现等方面.随着开源代码量的不断增加,代码相似性检测的大规
模化已成为必然趋势.大规模相似检测除了可以帮助开发人员管理项目中的开源代码外,还可用于发现软件系
统中组件来源、进行代码搜索、研究组件引用频率等,有助于保证软件质量,降低开发与维护的成本.目前常用
的代码相似性检测方法包括:基于指标(metrics-based)、基于文本(text-based)、基于词法(token-based)、基于树
[1]
(tree-based)和基于图(graph-based)这 5 个层次 .出于对规模和效率上的考量,本文在基于文本的行粒度相似哈
希算法的基础上进行了优化,提出了分语言行筛选的相似哈希检测方法.该方法一方面延续了相似哈希高效相
似检索的特性,将源代码预处理后映射为一个二进制数(即代码的指纹),以实现对代码文件的降维和索引,加速
大规模库的构建;另一方面结合不同语言代码的特性,排除了常见行对指纹生成的影响,使得指纹结果更能够体
现代码特性,以大幅度地提高相似检测的精确度.该方法具有易于实现、部署简单、无其他词法或语法分析器
依赖、构建效率高以及语言迁移成本低等特点,与此同时,其高精确度的特点保证了结果中错误匹配对数量较
低,降低了后续人工核验的成本.
本文的主要贡献可以简要总结如下.
[2]
(1) 对 Zhu 等人 提出的基于行粒度的相似哈希算法进行实验分析,发现现有方法中大量存在的行覆盖现
[3]
象.受 TF-IDF 思想启发,本文针对代码数据集的独有特点创新性地提出了基于分语言行筛选的优化方法,在现
有方案上引入了常见行筛选器以处理行覆盖问题.通过两轮实验验证改进后算法的有效性,精确度分别为 99%
和 97%,对比现有方法分别提高了 98.6%和 52.2%.