Page 24 - 《软件学报》2021年第8期

P. 24

2306 Journal of Software 软件学报 Vol.32, No.8, August 2021

[21] Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In: Proc. of the 2nd Workshop on General Purpose
Processing on Graphics Processing Units (GPGPU 2009). Washington, 2009. 46−51.
[22] Yang C, Wang F, Du Y, Chen J, Liu J, Yi H. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: Proc. of
the 2010 IEEE Int’l Conf. on Cluster Computing. Heraklion, 2010. 19−28.
[23] Yamazaki I, Tomov S, Dongarra J. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. Procedia
Computer Science, 2012,9(11):37−46.
[24] Yang C, Chen C, Tang T, Chen X, Fang J, Xue J. An energy-efficient implementation of LU factorization on heterogeneous systems.
In: Proc. of the IEEE 22nd Int’l Conf. on Parallel and Distributed Systems (ICPADS). Wuhan, 2016. 971−979.
[25] Jo G, Nah J, Lee J, Kim J, Lee J. Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans. on
Parallel & Distributed Systems, 2015,26(7):1814−1825.
[26] Li J, Li X, Tan G, Chen M, Sun N. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In: Proc. of the 26th
ACM Int’l Conf. on Supercomputing (ICS 2012). New York, 2012. 377−386.
[27] Li LS, Yang WH, Ma WJ, Zhang Y, Zhao H, Zhao HT, Li HY, Sun JC. Optimization of HPL on complex heterogeneous computing
system. Ruan Jian Xue Bao/Journal of Software, 2021,32(8):2307−2318 (in Chinese with English abstract). http://www.jos.org.cn/
1000-9825/6003.htm [doi: 10.13328/j.cnki.jos.006003]
[28] http://www.netlib.org/benchmark/hpl/HPL_pdpanllN.html
[29] Sun CG, Lan J, Jiang H. Genetic algorithm for deciding blocking size parameters of GEMM in BLAS. Computer Engineering &
Science, 2018,40(5):798−804 (in Chinese with English abstract).
[30] Low T, Igual F, Smith T, Quintana-Orti E. Analytical modeling is enough for high-performance BLIS. ACM Trans. on
Mathematical Software, 2016,43(2):1−18.
[31] Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science and
Engineering, 1998,5(1):46−55.
[32] https://computing.llnl.gov/tutorials/pthreads/

附中文参考文献:
[10] 顾乃杰,李凯,陈国良,吴超.基于龙芯 2F 体系结构的 BLAS 库优化.中国科学技术大学学报,2008,38(7):854−859.
[16] 孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的 1、2 级 BLAS 函数优化研究.计算机系统应用,2017,26(11):101−108.
[17] 刘昊,刘芳芳,张鹏,杨超,蒋丽娟.基于申威 1600 的 3 级 BLAS GEMM 函数优化.计算机系统应用,2016,25(12):234−239.
[18] 郭正红,郭绍忠,许瑾晨,张兆天.异构多核平台下基础数学库寄存器分配方法.计算机应用,2014,34(S1):86−89.
[27] 黎雷生,杨文浩,马文静,张娅,赵慧,赵海涛,李会元,孙家昶.复杂异构计算系统 HPL 的优化.软件学报,2021,32(8):2307−2318 (in
Chinese with English abstract). http://www.jos.org.cn/1000-9825/6003.htm [doi: 10.13328/j.cnki.jos.006003]
[29] 孙成国,兰静,姜浩.一种基于遗传算法的 BLAS 库优化方法.计算机工程与科学,2018,40(5):798−804.

蔡雨(1988－),男,高级主管工程师,主要研刘子行(1977－),男,高级主管工程师,主要
究领域为 CPU 架构,性能优化. 研究领域为安全软件.

孙成国(1985－),男,高级主管工程师,主要康梦博(1989－),女,高级工程师,主要研究
研究领域为高性能计算,性能优化. 领域为性能优化.

杜朝晖(1975－),男,主任工程师,主要研究李双双(1984－),男,高级工程师,主要研究
领域为安全软件. 领域为数学库.

19 20 21 22 23 24 25 26 27 28 29