Page 45 - 《软件学报》2021年第8期

P. 45

水超洋等:国产异构系统上 HPL 的优化与分析 2327

Fig.8 Sugon E-prototype supercomputer HPL performance
图 8 曙光 E 级超算原型机 HPL 性能

5 结论

本文提出的异构 HPL 算法通过将矩阵存储于国产加速器的内存解决了数据传输瓶颈,通过多线程细粒度
的算法软流水实现了对通信开销的掩盖,通过一个轻量级异构加速框架 HPCX 提供的对国产加速器的基本操
作的抽象实现了跨平台的异构 HPL 算法.在同类异构系统上,我们实现的算法性能远远超过开源的工作,并且优
于 NVIDIA 公司的非开源 HPL 程序.我们的算法也展示了良好的扩展性,在曙光 E 级超算原型机 512 个节点
HPL 测试中实现了 71.1%的效率.同时,我们的性能模型也展示了较高的准确性,可以为未来 E 级异构超算的
HPL 性能预测提供参考.

References:
[1] Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice
and Experience, 2003,15(9):803−820.
[2] Walker DDW. Software libraries for linear algebra computations on high performance computers. SIAM Review, 1995,37(2):
151−180.
[3] TOP500. 2019. https://www.top500.org/
[4] Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Sancho SJC. Entering the petaflop era: The architecture and performance of
roadrunner. In: Proc. of the 2008 ACM/IEEE Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
IEEE, 2008. 1−11.
[5] Husbands P, Yelick K. Multi-threading and one-sided communication in parallel LU factorization. In: Proc. of the 2007 ACM/IEEE
Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2007. 1−10.
[6] Fatica M. Accelerating linpack with CUDA on heterogenous clusters. In: Proc. of the 2nd ACM Workshop on General Purpose
Processing on Graphics Processing Units (GPGPU-2). ACM, 2009.46−51.
[7] Yang CQ, Wang F, Du YF, et al. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: Proc. of the 2010
IEEE Int’l Conf. on CLUSTER Computing. IEEE, 2010. 19−28.
[8] Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-
node systems based on Intel® Xeon Phi coprocessor. In: Proc. of the 2013 IEEE Int’l Symp on Parallel & Distributed Processing
(IPDPS). IEEE, 2013. 126−137.
[9] Goto K, Geijn RAVD. Anatomy of high-performance matrix multiplication. ACM Trans. on Mathematical Software, 2008,34(3):
1−25.
[10] Li JJ, Li XJ, Tan GM, et al. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proc. of the 26th ACM
Int’l Conf. on Supercomputing (ICS). ACM, 2012. 377−386.

40 41 42 43 44 45 46 47 48 49 50