Page 68 - 《软件学报》2021年第8期
P. 68

2350                                   Journal of Software  软件学报 Vol.32, No.8,  August 2021

                    本文基于某国产复杂异构超级计算机系统进行测试,分别测试了单节点计算效率和整机计算效率,并测试
                 了整机的弱可扩展性,最终单节点计算效率达到了 1.82%,整机计算效率达到了 1.67%,整机弱可扩展性并行效
                 率达到了 92%.

                 9    结论及下一步工作

                    本文面向国产复杂异构超级计算机研制了异构众核并行 HPCG 软件,从着色算法、异构任务划分、稀疏矩
                 阵存储格式等角度展开研究,提出了一套适用于结构化网格的着色算法,用于 HPCG 后,着色质量与现有算法相
                 比有明显的提升.通过分析异构部件的传输开销,提出了一套异构任务划分方法,并采用 ELL 格式存储稀疏矩阵,
                 以减少访存开销.同时采用分支消除、循环展开、数据预取等多种优化方式进行了优化.在多进程实现时,为了
                 尽可能地提升整机性能,本文还采用了内外区划分的算法,将邻居通信与内区计算进行重叠,以隐藏通信开销,
                 提高并行效率.最终整机计算效率达到了峰值性能的 1.67%,弱可扩展性并行效率更是提高到了 92%.下一步将
                 面向其他国产超级计算机开展 HPCG 的并行与优化工作,研究 HPCG 混合精度算法,并与相关应用进行合作,提
                 升应用整体性能.


                 References:
                 [1]    Dongarra J, Heroux MA. Toward a new metric for ranking high performance computing systems. Technical Report, SAND2013-
                     4744, Sandia National Laboratories, 2013.
                 [2]    Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present and future. Concurrency and Computation Practice &
                     Experience, 2003,15(9):803−820.
                 [3]    https://www.isc-hpc.com/
                 [4]    http://www.supercomputing.org/
                 [5]    Haidar A, Tomov S, Dongarra J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative
                     refinement solvers. In: Proc. of the Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC 2018).
                     IEEE, 2018. 603−613.
                 [6]    Haidar A, Wu P, Tomov S, et al. Investigating half precision arithmetic to accelerate dense linear system solvers. In: Proc. of the
                     8th Workshop on Latest Advances in Scalable Algorithms for Large-scale Systems. ACM, 2017. 1−8.
                 [7]    Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc. of the Conf. on
                     High Performance Computing Networking, Storage and Analysis. 2009. 1−11.
                 [8]    Vazquez F, Fernandez J,  Garzon  E. A new  approach for sparse  matrix vector product on NVIDIA  GPUs. Concurrency  and
                     Computation: Practice and Experience, 2011,23(8):815−826.
                 [9]    Ashari A, Sedaghati N, Eisenlohr J, et al. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on
                     GPUs. In: Proc. of the 28th ACM Int’l Conf. on Supercomputing. ACM Press, 2014. 273−282.
                [10]    Yan S, Li C, Zhang Y, et al. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices, 2014,49(8):107−118.
                [11]    Pichel JC, Rivera FF, Fernández M, et al. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs.
                     Microprocessors and Microsystems, 2012,36(2):65−77.
                [12]    Tang  WT, Tan  WJ, Ray R,  et al. Accelerating sparse  matrix-vector  multiplication on  GPUs using bit-representation-optimized
                     schemes. In: Proc. of the Int’l Conf. on High Performance Computing, Networking, Storage and Analysis. 2013. 1−12.
                [13]    Williams  S, Oliker L, Vuduc R,  et al. Optimization of sparse  matrix–vector  multiplication on  emerging  multicore platforms.
                     Parallel Computing, 2009,35(3):178−194.
                [14]    Guo D, Gropp W. Adaptive thread distributions for SpMV on a GPU. In: Proc. of the Extreme Scaling Workshop. 2012. 1−5.
                [15]    Park J, Smelyanskiy M, Vaidyanathan K, et al. Efficient shared-memory implementation of high-performance conjugate gradient
                     benchmark and its application to unstructured matrices. In: Proc. of the Int’l Conf. for High Performance Computing, Networking,
                     Storage and Analysis (SC 2014). IEEE, 2014. 945−955.
                [16]    Iwashita T, Nakanishi Y,  Shimasaki M. Comparison criteria  for parallel  orderings  in ILU  preconditioning. SIAM Journal  on
                     Scientific Computing, 2005,26(4):1234−1260.
   63   64   65   66   67   68   69   70   71   72   73