Page 23 - 《软件学报》2021年第8期
P. 23
蔡雨 等:异构 HPL 算法中 CPU 端高性能 BLAS 库优化 2305
6 总结与展望
本文针对 CPU 平台系统架构的特点,在开源 BLIS 算法库的基础上开发了优化的 HBLIS 数学库,详细介绍
了 HPL 调用的各级 BLAS 函数优化方法.通过访存优化、指令集优化和多线程并行优化,HPL 调用的各级 BLAS
函数的计算性能与 MKL 的相应函数相比有着普遍和显著的提升,个别计算量小的函数性能基本与 MKL 相应
函数性能一致,其中最重要的 DGEMM 函数提升了约 25%,DGEMV 函数性能提升了约 62%.在异构单节点 HPL
测试中,与调用 MKL 库的 HPL 相比,采用 HBLIS 的 HPL 整体效率值提升了 11.8%,更好地发挥了异构高性能平
台的计算能力.同时本文也有不足之处,虽然 HBLIS 库性能测试发现 DGEMM、DGEMV、DTRSV 和 IDMAX
有着很好的优化效果,其性能与 MKL 和开源 BLIS 库相应函数相比都有大幅度提升,但是 DTRSM 和 DSCAL
优化后性能并没有优于 Intel MKL,这可能与计算量、多线程库支持和编译器优化支持都有密切关联.底层子例
程库优化是一个系统工程,不仅需要硬件架构的优化设计,而且需要编译器和线程库等底层软件的优化.本文未
涉及的其他 BLAS 函数的优化将是 HBLIS 库下一步优化工作的方向.
References:
[1] Whaley RC, Dongarra JJ. Automatically tuned linear algebra software. In: Proc. of the 1998 ACM/IEEE Conf. on Supercomputing
(SC’98). San Jose, 1998. 1−27.
[2] Goto K, van de Geijn RA. Anatomy of high-performance matrix Multiplication. ACM Trans. on Mathematical Software, 2008,34(3):
1−25.
[3] Goto K, van de Geijn RA. High-performance implementation of the Level-3 BLAS. ACM Trans. on Mathematical Software, 2008,
35(1):1−14.
[4] Wang Q, Zhang X, Zhang Y, Qing Y. AUGEM: Automatically generate high performance dense linear algebra kernels on X86
CPUs. In: Proc. of the Int’l Conf. on High Performance Computing, Networking, Storage and Analysis (SC 2013). Denver, 2013.
1−12.
[5] https://github.com/xianyi/OpenBLAS
[6] https://github.com/flame/blis
[7] Van Zee FG, van de Geijn RA. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical
Software, 2015:41(3):1−33.
[8] Van Zee FG, Smith TM, Marker B, Low TM, van de Geijn RA, Igual FD, Smelyanskiy M, Zhang XY, Kistler M, Austel V, Gunnels
JA, Killough L. The BLIS Framework: Experiments in Portability. ACM Trans. on Mathematical Software, 2016,42(2):1−19.
[9] Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG. Anatomy of high-performance many-threaded matrix
multiplication. In: Proc. of the IEEE 28th Int’l Parallel and Distributed Processing Symp. 2014. 1049−1059.
[10] Gu NJ, Li K, Chen GL, Wu C. Optimization of BLAS based on Loongson 2F architecture. Journal of University of Science and
Technology of China, 2008,38(7):854−859 (in Chinese with English abstract).
[11] Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice
and Experience, 2003,15(9):803−820.
[12] Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N. Fast implementation of DGEMM on Fermi GPU. In: Proc. of the 2011 Int’l Conf.
on High Performance Computing, Networking, Storage and Analysis (SC 2011). Seattle, 2011. 1−11.
[13] Jiang H, Wang F, Zuo K, Su X, Xue L, Yang C. Design and implementation of a highly efficient DGEMM for 64-bit ARMv8
multi-core processors. In: Proc. of the 44th Int’l Conf. on Parallel Processing. Beijing, 2015. 200−209.
[14] Jiang H, Wang F, Li K, Yang C, Zhao K, Huang C. Implementation of an accurate and efficient compensated DGEMM for 64-bit
ARMv8 multi-core processors. In: Proc. of the IEEE 21st Int’l Conf. on Parallel and Distributed Systems (ICPADS). Melbourne,
2015. 491−498.
[15] Wang L, Wu W, Xu Z, Xiao J, Yang Y. BLASX: A high performance level-3 BLAS library for heterogeneous multi-GPU
computing. In: Proc. of the 2016 Int’l Conf. on Supercomputing (ICS 2016). Istanbul, 2016. 1−11.
[16] Sun JD, Sun Q, Deng P, Yang C. Research on the optimization of BLAS level 1 and 2 functions on Shenwei many-core processor.
Computer Systems & Applications, 2017,26(11):101−108 (in Chinese with English abstract).
[17] Liu H, Liu FF, Zhang P, Yang C, Jiang LJ. Optimization of BLAS level 3 functions on SW1600. Computer Systems & Applications,
2016,25(12):234−239 (in Chinese with English abstract).
[18] Guo ZH, Guo SZ, Xu JC, Zhang ZT. Register allocation in base mathematics library for platform of heterogenerous multi-core.
Journal of Computer Applications, 2014,34(S1):86−89 (in Chinese with English abstract).
[19] https://www.mcs.anl.gov/research/projects/mpi/index.htm
[20] https://www.cs.colostate.edu/cameron/Vsipl.html