Page 37 - 《软件学报》2021年第8期

P. 37

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(8):2319−2328 [doi: 10.13328/j.cnki.jos.006004] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563

∗
国产异构系统上 HPL 的优化与分析

1,2
1,2
1,2
1,2
水超洋 , 于献智 , 王银山 , 谭光明
1
(中国科学院计算技术研究所,北京 100190)
2
(中国科学院大学,北京 100190)
通讯作者: 谭光明, E-mail: tgm@ict.ac.cn

摘要: 随着异构系统成为建造超级计算机的重要选择,如何让 CPU 与加速器协调工作以充分发挥异构系统的
计算性能具有重要意义.HPL 是高性能计算领域最重要的基准测试程序,传统面向纯 CPU 系统的 HPL 算法通过加
速器加速矩阵乘法的做法已经无法取得很好的性能.针对这一问题,提出了基于国产处理器-国产加速器异构系统
的 HPL 性能模型和多线程细粒度流水 HPL 算法.完成了一个轻量级跨平台异构加速框架 HPCX,以实现跨平台的
HPL 算法.该性能模型能够准确地预测类似异构系统的 HPL 性能.该 HPL 算法在 NVIDIA GPU 平台上性能超过了
NVIDIA 官方闭源 nvhpl 程序 9%.在国产处理器-国产加速器平台 512 个节点的规模上,优化的 HPL 算法实现了
2.3 PFLOPS 实测峰值性能和 71.1%的浮点效率.
关键词: HPL;异构系统;跨平台;性能建模;E 级计算
中图法分类号: TP303

中文引用格式: 水超洋,于献智,王银山,谭光明.国产异构系统上 HPL 的优化与分析.软件学报,2021,32(8):2319–2328. http://
www.jos.org.cn/1000-9825/6004.htm
英文引用格式: Shui CY, Yu XZ, Wang YS, Tan GM. Optimization and analysis of HPL on domestic heterogeneous system.
Ruan Jian Xue Bao/Journal of Software, 2021,32(8):2319−2328 (in Chinese). http://www.jos.org.cn/1000-9825/6004.htm

Optimization and Analysis of HPL on Domestic Heterogeneous System
1,2
1,2
1,2
1,2
SHUI Chao-Yang , YU Xian-Zhi , WANG Yin-Shan , TAN Guang-Ming
1 (Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China)
2 (University of Chinese Academy of Sciences, Beijing 100190, China)
Abstract: As heterogeneous system becomes one of the most important choices to build super computers, how to orchestrate CPU and
accelerator to leverage the great computability of heterogeneous systems is of great significance. HPL is the most important benchmark in
HPC field, traditional HPL algorithm targeting at CPU-only systems cannot achieve high performance by only offloading matrix
multiplication workload to accelerators. To solve this problem, this work proposes a HPL performance model and a multithread
fine-grained pipelining algorithm for domestic-processor-domestic-accelerator heterogeneous system. Meanwhile, a light weight
cross-platform heterogeneous framework is implemented to carry out a cross-platform HPL algorithm. The proposed performance model
predicts HPL performance accurately on similar heterogeneous systems. On NVIDIA platform, the proposed HPL algorithm outperforms
the NVIDIA proprietary counterparts by 9%. On domestic-processor-domestic-accelerator platform, the finally optimized Linpack

∗ 基金项目: 国家重点研发计划(2018YFB0204400, 2016YFB0201305, 2016YFB0200803, 2016YFB0200300); 中国科学院战略性
先导科技专项(C 类)(XDC01030000); 国家自然科学基金(61972377, 61432018, 61702483); 中国科学院前沿科学重点研究计划
(QYZDJ-SSW-JSC035)
Foundation item: National Key Research and Development Program of China (2018YFB0204400, 2016YFB0201305, 2016YFB020
0803, 2016YFB0200300); Strategic Priority Research Program of the Chinese Academy of Sciences (Category C) (XDC01030000);
National Natural Science Foundation of China (61972377, 61432018, 61702483); Key Research Program of Frontier Sciences of the
Chinese Academy of Sciences (QYZDJ-SSW-JSC035)
本文由“国产复杂异构高性能数值软件的研制与测试”专题特约编辑孙家昶研究员、李会元研究员推荐.
收稿时间: 2019-08-16; 修改时间: 2019-12-05; 定稿时间: 2020-01-22

32 33 34 35 36 37 38 39 40 41 42