Page 23 - 《软件学报》2021年第8期
P. 23

蔡雨  等:异构 HPL 算法中 CPU 端高性能 BLAS 库优化                                              2305


                 6    总结与展望

                    本文针对 CPU 平台系统架构的特点,在开源 BLIS 算法库的基础上开发了优化的 HBLIS 数学库,详细介绍
                 了 HPL 调用的各级 BLAS 函数优化方法.通过访存优化、指令集优化和多线程并行优化,HPL 调用的各级 BLAS
                 函数的计算性能与 MKL 的相应函数相比有着普遍和显著的提升,个别计算量小的函数性能基本与 MKL 相应
                 函数性能一致,其中最重要的 DGEMM 函数提升了约 25%,DGEMV 函数性能提升了约 62%.在异构单节点 HPL
                 测试中,与调用 MKL 库的 HPL 相比,采用 HBLIS 的 HPL 整体效率值提升了 11.8%,更好地发挥了异构高性能平
                 台的计算能力.同时本文也有不足之处,虽然 HBLIS 库性能测试发现 DGEMM、DGEMV、DTRSV 和 IDMAX
                 有着很好的优化效果,其性能与 MKL 和开源 BLIS 库相应函数相比都有大幅度提升,但是 DTRSM 和 DSCAL
                 优化后性能并没有优于 Intel MKL,这可能与计算量、多线程库支持和编译器优化支持都有密切关联.底层子例
                 程库优化是一个系统工程,不仅需要硬件架构的优化设计,而且需要编译器和线程库等底层软件的优化.本文未
                 涉及的其他 BLAS 函数的优化将是 HBLIS 库下一步优化工作的方向.

                 References:
                 [1]    Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In: Proc. of the 1998 ACM/IEEE Conf. on Supercomputing
                     (SC’98). San Jose, 1998. 1−27.
                 [2]    Goto K, van de Geijn RA. Anatomy of high-performance matrix Multiplication. ACM Trans. on Mathematical Software, 2008,34(3):
                     1−25.
                 [3]    Goto K, van de Geijn RA. High-performance implementation of the Level-3 BLAS. ACM Trans. on Mathematical Software, 2008,
                     35(1):1−14.
                 [4]    Wang  Q, Zhang  X, Zhang  Y,  Qing  Y.  AUGEM:  Automatically generate high performance dense linear  algebra kernels on  X86
                     CPUs. In: Proc. of the Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2013). Denver, 2013.
                     1−12.
                 [5]    https://github.com/xianyi/OpenBLAS
                 [6]    https://github.com/flame/blis
                 [7]    Van Zee FG, van de Geijn RA. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical
                     Software, 2015:41(3):1−33.
                 [8]    Van Zee FG, Smith TM, Marker B, Low TM, van de Geijn RA, Igual FD, Smelyanskiy M, Zhang XY, Kistler M, Austel V, Gunnels
                     JA, Killough L. The BLIS Framework: Experiments in Portability. ACM Trans. on Mathematical Software, 2016,42(2):1−19.
                 [9]    Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG. Anatomy of high-performance many-threaded matrix
                     multiplication. In: Proc. of the IEEE 28th Int’l Parallel and Distributed Processing Symp. 2014. 1049−1059.
                [10]    Gu NJ, Li K, Chen GL, Wu C. Optimization of BLAS based on Loongson 2F architecture. Journal of University of Science and
                     Technology of China, 2008,38(7):854−859 (in Chinese with English abstract).
                [11]    Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice
                     and Experience, 2003,15(9):803−820.
                [12]    Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N. Fast implementation of DGEMM on Fermi GPU. In: Proc. of the 2011 Int’l Conf.
                     on High Performance Computing, Networking, Storage and Analysis (SC 2011). Seattle, 2011. 1−11.
                [13]    Jiang H, Wang F, Zuo K, Su X, Xue L, Yang C. Design and implementation of a highly efficient DGEMM for 64-bit ARMv8
                     multi-core processors. In: Proc. of the 44th Int’l Conf. on Parallel Processing. Beijing, 2015. 200−209.
                [14]    Jiang H, Wang F, Li K, Yang C, Zhao K, Huang C. Implementation of an accurate and efficient compensated DGEMM for 64-bit
                     ARMv8 multi-core processors. In: Proc. of the IEEE 21st Int’l Conf. on Parallel and Distributed Systems (ICPADS). Melbourne,
                     2015. 491−498.
                [15]    Wang L,  Wu  W, Xu Z, Xiao  J, Yang  Y. BLASX: A  high  performance  level-3 BLAS library  for  heterogeneous multi-GPU
                     computing. In: Proc. of the 2016 Int’l Conf. on Supercomputing (ICS 2016). Istanbul, 2016. 1−11.
                [16]    Sun JD, Sun Q, Deng P, Yang C. Research on the optimization of BLAS level 1 and 2 functions on Shenwei many-core processor.
                     Computer Systems & Applications, 2017,26(11):101−108 (in Chinese with English abstract).
                [17]    Liu H, Liu FF, Zhang P, Yang C, Jiang LJ. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications,
                     2016,25(12):234−239 (in Chinese with English abstract).
                [18]    Guo ZH, Guo SZ, Xu JC, Zhang ZT. Register allocation in base mathematics library for platform of heterogenerous multi-core.
                     Journal of Computer Applications, 2014,34(S1):86−89 (in Chinese with English abstract).
                [19]    https://www.mcs.anl.gov/research/projects/mpi/index.htm
                [20]    https://www.cs.colostate.edu/cameron/Vsipl.html
   18   19   20   21   22   23   24   25   26   27   28