Page 89 - 《软件学报》2020年第11期
P. 89

段旭  等:基于代码属性图及注意力双向 LSTM 的漏洞挖掘方法                                                3405

                 Abstract:    With the increasingly serious trend of  information security, software vulnerability has become  one of the  main threats to
                 computer security. How to accurately mine vulnerabilities in the program is a key issue in the field of information security. However,
                 existing static vulnerability mining methods have low accuracy when mining vulnerabilities with unobvious vulnerability features. On the
                 one hand, rule-based  methods by  matching  expert-defined  code vulnerability patterns in target programs. Its  predefined vulnerability
                 pattern is rigid and single, which is unable to cover detailed features and result in problems of low accuracy and high false positives. On
                 the other hand, learning-based methods cannot adequately model the features of the source code and cannot effectively capture the key
                 feature, which makes it fail to accurately mine vulnerabilities with unobvious vulnerability features. To solve this issue, a source code
                 level  vulnerability mining method  based  on code  property  graph and attention BiLSTM is  proposed. It firstly  transforms  the  program
                 source code to code property graph which contains semantic features, and performs program slicing to remove redundant information that
                 is not related to sensitive operations. Then, it encodes the code property graph into the feature tensor with encoding algorithm. After that,
                 a  neural  network based  on BiLSTM and attention mechanism is  trained  using large-scale feature datasets.  Finally, the  trained neural
                 network model is used to mine the vulnerabilities in the target program. Experimental results show that the F1 scores of the method reach
                 82.8%, 77.4%, 82.5%, and 78.0% respectively on the SARD buffer error dataset, SARD resource management error dataset, and their two
                 subsets  composed  of  C programs,  which is significantly higher than the  rule-based static  mining tools Flawfinder  and  RATS  and the
                 learning-based program analysis model TBCNN.
                 Key words:    vulnerability mining; deep learning; static analysis; attention mechanism; code property graph

                 计算机系统安全的主要威胁之一.一旦漏洞被攻击者利用,就会导致严重的后果.据美国 MITRE 公司发布的数
                 据,截止 2019 年 4 月 15 日,CVE(common vulnerabilities and exposures,公共漏洞和暴露数据库)中已存在 121 078
                 较为刻板单一,无法覆盖到细节特征,导致其存在准确率低、误报率高等问题.例如,Flawfinder 和 RATS 是两
                 息进行建模,并且无法有效捕捉关键特征信息.例如,TBCNN 将程序源代码建模为抽象语法树,然后根据抽象
                 property graph,简称 CPG)及注意力双向 LSTM 的源码级漏洞挖掘方法.该方法首先将程序源代码生成代码属性
                 属性图编码为特征张量;最后搭建基于双向 LSTM 和注意力机制的神经网络模型,使用特征张量对神经网络模
                 张量,有效地避免了语义信息的丢失.并且针对如图 1 中实线框所示的微小差异以及虚线框所示的冗余信息,分
                    为了验证本方法的有效性,本文分别选取 Flawfinder、RATS 作为基于规则的方法基线,在 SARD 缓冲区错
                 误数据集和 SARD 资源管理错误数据集上进行实验.同时,选取 TBCNN 作为基于学习的方法基线,由于 TBCNN
                 所使用的静态分析工具仅支持对 C 语言源程序解析,因此在 SARD 缓冲区错误数据集和 SARD 资源管理错误
                 数据集中 C 语言源程序构成的子集上进行实验.实验结果表明,本方法在上述 4 个数据集上的 F1 分数分别达到
                 了 82.8%、77.4%、82.5%以及 78.0%,相比各基线提升 10%以上.此外,对于注意力机制的可视化实验,证明其可
   84   85   86   87   88   89   90   91   92   93   94