Page 7 - 《软件学报》2021年第8期
P. 7
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(8):2289−2306 [doi: 10.13328/j.cnki.jos.006002] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
∗
异构 HPL 算法中 CPU 端高性能 BLAS 库优化
蔡 雨, 孙成国, 杜朝晖, 刘子行, 康梦博, 李双双
(信息技术有限公司,江苏 苏州 215000)
通讯作者: 孙成国, E-mail: sunchengguo1@163.com
摘 要: 异构 HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用 CPU 计算能力,加速部件
集成了更多的计算核心,负责主要的计算,通用 CPU 负责任务调度的同时也参与计算.在合理划分任务、平衡负载的
前提下,优化 CPU 端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对 BLAS(basic linear algebra
subprograms)函数进行优化往往可以更加充分地利用通用 CPU 计算能力,提高系统整体效率.BLIS(BLAS-like
library instantiation software)算法库是开源的 BLAS 函数框架,具有易开发、易移植和模块化等优点.基于异构系统
平台体系结构以及 HPL 算法特点,充分利用三级缓存、向量化指令和多线程并行等技术手段优化 CPU 端调用的各
级 BLAS 函数,应用 auto-tuning 技术优化矩阵分块参数,从而形成了异构环境下优化的 BLIS 算法库 HBLIS.与 MKL
相比,HPL 整体性能提高了 11.8%.
关键词: BLAS;遗传算法 auto-tuning;向量化指令;数据预取;多线程并行
中图法分类号: TP303
中文引用格式: 蔡雨,孙成国,杜朝晖,刘子行,康梦博,李双双.异构 HPL 算法中 CPU 端高性能 BLAS 库优化.软件学报,2021,
32(8):2289–2306. http.//www.jos.org.cn/1000-9825/6002.htm
英文引用格式: Cai Y, Sun CG, Du ZH, Liu ZX, Kang MB, Li SS. CPU-side high performance BLAS library optimization in
heterogeneous HPL algorithm. Ruan Jian Xue Bao/Journal of Software, 2021,32(8):2289–2306 (in Chinese). http.//www.jos.org.cn/
1000-9825/6002.htm
CPU-side High Performance BLAS Library Optimization in Heterogeneous HPL Algorithm
CAI Yu, SUN Cheng-Guo, DU Zhao-Hui, LIU Zi-Xing, KANG Meng-Bo, LI Shuang-Shuang
(Information Technology Co., Ltd., Suzhou 215000, China)
Abstract: Improving the efficiency of heterogeneous HPL needs to fully utilize the computing power of acceleration components and
CPU, the acceleration components integrate more computing cores and are responsible for the main calculation. The general CPU is
responsible for task scheduling and also participates in calculation. Under the premise of reasonable division of tasks and load balancing,
optimizing CPU-side computing performance is particularly important to improve overall efficiency. Optimizing the basic linear algebra
subprogram (BLAS) functions for specific platform architecture characteristics can often make full use of general-purpose CPU computing
capabilities to improve the overall system efficiency. The BLIS (BLAS-like library instantiation software) algorithm library is an open
source BLAS function framework, which has the advantages of easy development, portability, and modularity. Based on the heterogeneous
system platform architecture and HPL algorithm characteristics, this study uses three-level cache, vectorized instructions, and
multi-threaded parallel technology to optimize the BLAS functions called by the CPU, applies auto-tuning technology to optimize the
matrix block parameters, and eventually forms the optimized BLIS algorithm library in heterogeneous environment. Compared with MKL,
the overall performance of the HPL using the optimized HBLIS has been improved by 11.8%.
Key words: BLAS; genetic algorithm auto-tuning; vectorization instruction; data prefetching; multi-threading parallelization
BLAS(basic linear algebra subprograms)是基本线性代数子程序的缩写,是目前应用广泛的核心线性代数数
∗ 本文由“国产复杂异构高性能数值软件的研制与测试”专题特约编辑孙家昶研究员、李会元研究员推荐.
收稿时间: 2019-07-25; 修改时间: 2019-12-05, 2020-01-22, 2020-03-19; 定稿时间: 2020-03-27