Page 102 - 《软件学报》2025年第7期

P. 102

沈庆超等: 深度学习编译器缺陷实证研究: 现状与演化分析 3023

and even lead to disastrous consequences sometimes. To deeply understand the characteristics of DL compiler bugs, the existing works
have analyzed 603 early bugs in DL compilers. In recent years, DL compilers have been updated rapidly, along with the introduction of a
large number of new features and the abandonment of some old ones. At the same time, several bug detection approaches for DL
compilers have been developed. Therefore, it is necessary to analyze whether the previous research conclusions on DL compiler bugs are
still applicable. In addition, there is a lack of in-depth exploration of the relationship among the symptoms, root causes, and locations of
bugs, and the characteristics of bug-revealing tests and bug-fixing patches have not been studied. To deeply analyze the evolution process
of the current DL compiler bug characteristics and distribution over time, 613 recently fixed bugs in three mainstream DL compilers (i.e.,
TVM of Apache, Glow of Facebook, and AKG of Huawei) are collected in this study, and the characteristics such as root causes,
symptoms and locations of bugs are manually labeled. Based on the labeling results, this study deeply explores the distribution
characteristics of bugs from multiple dimensions and compares them with that in the existing works. Meanwhile, we further investigate the
characteristic of bug-revealing regression tests and bug-fixing patches. In total, this study summarizes 12 major findings to
comprehensively understand the current situation and evolution of DL compiler bugs and provide a series of feasible suggestions for the
detection, location, and repair of DL compiler bugs. Finally, to verify the effectiveness of the research findings in this work, a testing tool
CfgFuzz based on optimized configuration is developed. CfgFuzz conducts combinatorial tests on compilation configuration options and
finally detects 8 TVM bugs, 7 of which have been confirmed or fixed by developers.
Key words: deep learning compiler; bug analysis; empirical study; bug detection; bug characteristic

深度学习 (deep learning, DL) 已经被广泛应用到各个领域, 例如自动驾驶 [1] 、人脸识别 [2] 和软件工程 [3−6] .
近年来, 为了实现 DL 模型简易部署和高效推理目标, 深度学习编译器 (DL 编译器) 应运而生. DL 编译器以预训
练的 DL 模型作为输入, 通过执行一系列优化操作后, 生成可在特定硬件平台上高效执行的二进制代码.
作为深度学习领域的基础性软件, DL 编译器发挥着日益重要的作用, 存在缺陷的 DL 编译器将引起广泛的危
害. 具体而言, DL 编译器中的缺陷有两方面的潜在影响. 一方面, DL 编译器的缺陷可能会传播到所有经其编译的
DL 模型中, 为大量的 DL 模型埋下安全隐患. 另一方面, DL 编译器的缺陷使得对 DL 软件的异常诊断变得复杂,
开发人员难以判断 DL 软件的异常表现究竟源于模型构造阶段的错误, 还是 DL 编译器在编译过程中引入的缺陷,
这无疑加剧了 DL 软件中缺陷的排查难度. 因此, 深入理解 DL 编译器的缺陷并提升 DL 编译器的质量至关重要.
DL 编译器借鉴了传统编译器的编译流程方案来解决模型优化部署问题, 但这两类编译器在处理对象和优化
方式层面有着显著的不同 [7] . 一方面, 传统编译器以一种高级编程语言 (例如, Java、C++) 的源码程序作为输入, 并
输出另一种低级编程语言形式 (例如, 汇编语言、机器码) 的程序代码. 和传统编译器的编译对象不同, DL 编译器
处理的程序 (即 DL 模型) 缺乏明确的逻辑结构. 另一方面, DL 编译器具备特有的多级代码中间表示 (intermediate
representation, IR) 和大量针对深度学习特性设计的优化操作 (如算子融合). 因此, 已有针对传统编译器的缺陷检
测和定位技术对于 DL 编译器并不适用.
为了了解 DL 编译器缺陷的特征, Shen 等人 [8] 首次对编译器缺陷展开实证研究. 该工作对 TVM 、Glow [10]
[9]
和 nGraph [11] 这 3 款 DL 编译器中 603 个缺陷展开研究, 并得到了一系列切实可用的发现和启示, 推动了 DL 编译
器测试技术的发展 [12,13] . 近年来, 随着 DL 编译器的经历快速发展和迭代更新, 已有发现难以适用于当前 DL 编译
器的缺陷检测与调试, 其主要原因可总结为如下 3 点. (1) 已有工作所研究的 DL 编译器已发生很大的变化, 例如,
TVM 编译器在近 3 年的代码变更次数 (即项目仓库中已合并的拉取请求次数) 超过 5 000 次, nGraph 项目已经停
止维护; 近些年来涌现出一些新的 DL 编译器 (比如, 华为的 AKG [14] 和 Intel 的 OpenVINO [15] ). (2) 已有工作研究的
缺陷均为 2020 年 11 月之前检测并修复的, 距今时间较为久远. (3) 已有工作只关注缺陷本身的特征, 并没有对触
发缺陷的回归测试用例以及修复缺陷的补丁特征进行挖掘. 回归测试和缺陷补丁的研究同样有助于自动化缺陷检
测与修复方法的设计.
为了更加全面地调研当下 DL 编译器缺陷的特征, 本文选择 3 款 DL 编译器作为研究对象, 包括 Apache 的
[9]
TVM 、Facebook 的 Glow [10] 和华为的 AKG [14] . 对于每款 DL 编译器, 我们收集了 2020 年 11 月之后, 且超过 15
个月时间的新修复缺陷. 经过人工检查和去重后, 我们最终获得了 613 个 DL 编译器缺陷, 并对每个缺陷的根因、
症状和位置进行人工标注. 接着, 本文采用对比分析的方式, 分别从缺陷的根因、症状和位置这 3 个方面分析 DL

97 98 99 100 101 102 103 104 105 106 107