Page 45 - 《软件学报》2021年第8期
P. 45

水超洋  等:国产异构系统上 HPL 的优化与分析                                                       2327

                                      Fig.8    Sugon E-prototype supercomputer HPL performance
                                             图 8   曙光 E 级超算原型机 HPL 性能

                 5    结   论

                    本文提出的异构 HPL 算法通过将矩阵存储于国产加速器的内存解决了数据传输瓶颈,通过多线程细粒度
                 的算法软流水实现了对通信开销的掩盖,通过一个轻量级异构加速框架 HPCX 提供的对国产加速器的基本操
                 作的抽象实现了跨平台的异构 HPL 算法.在同类异构系统上,我们实现的算法性能远远超过开源的工作,并且优
                 于 NVIDIA 公司的非开源 HPL 程序.我们的算法也展示了良好的扩展性,在曙光 E 级超算原型机 512 个节点
                 HPL 测试中实现了 71.1%的效率.同时,我们的性能模型也展示了较高的准确性,可以为未来 E 级异构超算的
                 HPL 性能预测提供参考.

                 [1]    Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice
                     and Experience, 2003,15(9):803−820.
                 [2]    Walker  DDW. Software libraries for linear  algebra  computations on high performance  computers. SIAM  Review, 1995,37(2):
                 [3]    TOP500. 2019.
                 [4]    Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Sancho SJC. Entering the petaflop era: The architecture and performance of
                     roadrunner. In: Proc. of the 2008 ACM/IEEE Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
                     IEEE, 2008. 1−11.
                 [5]    Husbands P, Yelick K. Multi-threading and one-sided communication in parallel LU factorization. In: Proc. of the 2007 ACM/IEEE
                     Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 2007. 1−10.
                 [6]    Fatica M. Accelerating linpack with CUDA on heterogenous clusters. In: Proc. of the 2nd ACM Workshop on General Purpose
                     Processing on Graphics Processing Units (GPGPU-2). ACM, 2009.46−51.
                 [7]    Yang CQ, Wang F, Du YF, et al. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: Proc. of the 2010
                     IEEE Int’l Conf. on CLUSTER Computing. IEEE, 2010. 19−28.
                 [8]    Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-
                     node systems based on Intel® Xeon Phi coprocessor. In: Proc. of the 2013 IEEE Int’l Symp on Parallel & Distributed Processing
                     (IPDPS). IEEE, 2013. 126−137.
                 [9]    Goto K, Geijn RAVD. Anatomy of high-performance matrix multiplication. ACM Trans. on Mathematical Software, 2008,34(3):
                [10]    Li JJ, Li XJ, Tan GM, et al. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proc. of the 26th ACM
                     Int’l Conf. on Supercomputing (ICS). ACM, 2012. 377−386.
   40   41   42   43   44   45   46   47   48   49   50