Page 58 - 《软件学报》2021年第8期

P. 58

2340 Journal of Software 软件学报 Vol.32, No.8, August 2021

References:
[1] Dongarra JJ, Luszczek P, Petitet A. The LINPACK Benchmark: Past, present and future. Concurrency and Computation Practice &
Experience, 2003,15(9):803−820.
[2] TOP-500 Official website. 2021. http://www.top500.org
[3] Gan XB, Hu YK, Liu J, Chi LH, Xu H, Gong CY, Li SG, Yan YH. Customizing the HPL for China accelerator. SCIENCE CHINA:
Informtaion Sciences, 2018,61(4):Article No.042102.
[4] Van Zee FG, Van De Geijn RA. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical
Software, 2013,41(3):1−33.
[5] Greer B, Henry G. High performance software on Intel Pentium Pro processors or micro-ops to TeraFLOPS. In: Proc. of the
Supercomputing 1997 Conf. San Jose, 1997. 1−13. [doi: 10.1145/509593.509639]
[6] Jia Y, Luszczek P, Dongarra J. Multi-GPU implementation of LU factorization. In: Proc. of the Int’l Conf. on Computational
Science, 2012. 106−115.
[7] Bach M, Kretz M, Lindenstruth V, Rohr D. Optimized HPL for AMD GPU and multi-core CPU usage. Computer Science⎯
Research and Development, 2011,26(3-4):153−164.
[8] Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX. Optimizing Linpack benchmark on GPU-accelerated petascale supercomputer.
Journal of Computer Science and Technology, 2011,26(5):854−865. [doi: 10.1007/s11390-011-0184-1]
[9] Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet A, Chrysos G, Dubey G. Design and
implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi coprocessor. In: Proc. of
the IEEE 27th Int’l Symp. on Parallel and Distributed Processing. 2013. [doi:10.1109/ipdps.2013.113]
[10] Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In: Proc. of the 2nd Workshop on General Purpose
Processing on Graphics Processing Units. ACM, 2009. 46−51.
[11] Bach M, Rohr D. Scaling DGEMM to multiple Cayman GPUs and Interlagos many-core CPUs for HPL. 2011. http://developer.
amd.com/wordpress/media/2013/06/2909_1_final.pdf
[12] Womble D, Greenberg D, Wheat S, Riesen R. LU factorization and the LINPACK benchmark on the Intel Paragon. Sandia
Technical Report, Sandia National Laboratories, 1994.
[13] Offical website. 2021. https://www.olcf.ornl.gov/summit/
[14] Chen RZ, Huang LB, Chen XH, Wang ZY. Optimizing HPL benchmark on multi-GPU clusters. Computer Science, 2013,40(3):
107−110 (in Chinese with English abstract).
附中文参考文献:
[14] 陈任之,黄立波,陈顼颢,王志英.单节点多 GPU 集群下 HPL 动态负载均衡优化.计算机科学,2013,40(3):107−110.

孙乔(1989－),男,博士,高级工程师,主要马文静(1981－),女,博士,副研究员,CCF 专
研究领域为并行编程模型,并行算法. 业会员,主要研究领域为高性能计算.

孙家昶(1942－),男,研究员,博士生导师, 赵玉文(1987－),女,博士生,助理研究员,
主要研究领域为科学与工程计算的方法、 CCF 专业会员,主要研究领域为高性能
理论与应用,并行计算. 计算.

53 54 55 56 57 58 59 60 61 62 63