Page 58 - 《软件学报》2021年第8期
P. 58

2340                                   Journal of Software  软件学报 Vol.32, No.8,  August 2021

                 References:
                 [1]    Dongarra JJ, Luszczek P, Petitet A. The LINPACK Benchmark: Past, present and future. Concurrency and Computation Practice &
                     Experience, 2003,15(9):803−820.
                 [2]    TOP-500 Official website. 2021. http://www.top500.org
                 [3]    Gan XB, Hu YK, Liu J, Chi LH, Xu H, Gong CY, Li SG, Yan YH. Customizing the HPL for China accelerator. SCIENCE CHINA:
                     Informtaion Sciences, 2018,61(4):Article No.042102.
                 [4]    Van Zee FG, Van De Geijn RA. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. on Mathematical
                     Software, 2013,41(3):1−33.
                 [5]    Greer  B,  Henry  G.  High performance software on Intel Pentium Pro processors or  micro-ops to  TeraFLOPS. In: Proc. of  the
                     Supercomputing 1997 Conf. San Jose, 1997. 1−13. [doi: 10.1145/509593.509639]
                 [6]    Jia  Y,  Luszczek P,  Dongarra J.  Multi-GPU implementation of  LU factorization. In: Proc.  of the Int’l  Conf.  on Computational
                     Science, 2012. 106−115.
                 [7]    Bach M, Kretz M, Lindenstruth V, Rohr D. Optimized HPL  for AMD GPU and multi-core CPU  usage. Computer  Science⎯
                     Research and Development, 2011,26(3-4):153−164.
                 [8]    Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX. Optimizing Linpack benchmark on GPU-accelerated petascale supercomputer.
                     Journal of Computer Science and Technology, 2011,26(5):854−865. [doi: 10.1007/s11390-011-0184-1]
                 [9]    Heinecke  A, Vaidyanathan K, Smelyanskiy  M,  Kobotov A, Dubtsov  R, Henry  G, Shet A,  Chrysos  G,  Dubey G. Design  and
                     implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi coprocessor. In: Proc. of
                     the IEEE 27th Int’l Symp. on Parallel and Distributed Processing. 2013. [doi:10.1109/ipdps.2013.113]
                [10]    Fatica M. Accelerating  Linpack  with CUDA on heterogenous  clusters.  In: Proc. of  the 2nd Workshop on  General Purpose
                     Processing on Graphics Processing Units. ACM, 2009. 46−51.
                [11]    Bach M, Rohr D. Scaling DGEMM to multiple Cayman GPUs and Interlagos many-core CPUs for HPL. 2011. http://developer.
                     amd.com/wordpress/media/2013/06/2909_1_final.pdf
                [12]    Womble D, Greenberg D,  Wheat  S, Riesen R. LU  factorization and the LINPACK  benchmark  on  the Intel  Paragon.  Sandia
                     Technical Report, Sandia National Laboratories, 1994.
                [13]    Offical website. 2021. https://www.olcf.ornl.gov/summit/
                [14]    Chen RZ, Huang LB, Chen XH, Wang ZY. Optimizing HPL benchmark on multi-GPU clusters. Computer Science, 2013,40(3):
                     107−110 (in Chinese with English abstract).
                 附中文参考文献:
                 [14]  陈任之,黄立波,陈顼颢,王志英.单节点多 GPU 集群下 HPL 动态负载均衡优化.计算机科学,2013,40(3):107−110.


                              孙乔(1989-),男,博士,高级工程师,主要                      马文静(1981-),女,博士,副研究员,CCF 专
                              研究领域为并行编程模型,并行算法.                            业会员,主要研究领域为高性能计算.





                              孙家昶(1942-),男,研究员,博士生导师,                      赵玉文(1987-),女,博士生,助理研究员,
                              主要研究领域为科学与工程计算的方法、                           CCF 专业会员,主要研究领域为高性能
                              理论与应用,并行计算.                                  计算.
   53   54   55   56   57   58   59   60   61   62   63