Page 68 - 《软件学报》2021年第8期

P. 68

2350 Journal of Software 软件学报 Vol.32, No.8, August 2021

本文基于某国产复杂异构超级计算机系统进行测试,分别测试了单节点计算效率和整机计算效率,并测试
了整机的弱可扩展性,最终单节点计算效率达到了 1.82%,整机计算效率达到了 1.67%,整机弱可扩展性并行效
率达到了 92%.

9 结论及下一步工作

本文面向国产复杂异构超级计算机研制了异构众核并行 HPCG 软件,从着色算法、异构任务划分、稀疏矩
阵存储格式等角度展开研究,提出了一套适用于结构化网格的着色算法,用于 HPCG 后,着色质量与现有算法相
比有明显的提升.通过分析异构部件的传输开销,提出了一套异构任务划分方法,并采用 ELL 格式存储稀疏矩阵,
以减少访存开销.同时采用分支消除、循环展开、数据预取等多种优化方式进行了优化.在多进程实现时,为了
尽可能地提升整机性能,本文还采用了内外区划分的算法,将邻居通信与内区计算进行重叠,以隐藏通信开销,
提高并行效率.最终整机计算效率达到了峰值性能的 1.67%,弱可扩展性并行效率更是提高到了 92%.下一步将
面向其他国产超级计算机开展 HPCG 的并行与优化工作,研究 HPCG 混合精度算法,并与相关应用进行合作,提
升应用整体性能.

References:
[1] Dongarra J, Heroux MA. Toward a new metric for ranking high performance computing systems. Technical Report, SAND2013-
4744, Sandia National Laboratories, 2013.
[2] Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: Past, present and future. Concurrency and Computation Practice &
Experience, 2003,15(9):803−820.
[3] https://www.isc-hpc.com/
[4] http://www.supercomputing.org/
[5] Haidar A, Tomov S, Dongarra J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative
refinement solvers. In: Proc. of the Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC 2018).
IEEE, 2018. 603−613.
[6] Haidar A, Wu P, Tomov S, et al. Investigating half precision arithmetic to accelerate dense linear system solvers. In: Proc. of the
8th Workshop on Latest Advances in Scalable Algorithms for Large-scale Systems. ACM, 2017. 1−8.
[7] Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc. of the Conf. on
High Performance Computing Networking, Storage and Analysis. 2009. 1−11.
[8] Vazquez F, Fernandez J, Garzon E. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and
Computation: Practice and Experience, 2011,23(8):815−826.
[9] Ashari A, Sedaghati N, Eisenlohr J, et al. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on
GPUs. In: Proc. of the 28th ACM Int’l Conf. on Supercomputing. ACM Press, 2014. 273−282.
[10] Yan S, Li C, Zhang Y, et al. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices, 2014,49(8):107−118.
[11] Pichel JC, Rivera FF, Fernández M, et al. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs.
Microprocessors and Microsystems, 2012,36(2):65−77.
[12] Tang WT, Tan WJ, Ray R, et al. Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized
schemes. In: Proc. of the Int’l Conf. on High Performance Computing, Networking, Storage and Analysis. 2013. 1−12.
[13] Williams S, Oliker L, Vuduc R, et al. Optimization of sparse matrix–vector multiplication on emerging multicore platforms.
Parallel Computing, 2009,35(3):178−194.
[14] Guo D, Gropp W. Adaptive thread distributions for SpMV on a GPU. In: Proc. of the Extreme Scaling Workshop. 2012. 1−5.
[15] Park J, Smelyanskiy M, Vaidyanathan K, et al. Efficient shared-memory implementation of high-performance conjugate gradient
benchmark and its application to unstructured matrices. In: Proc. of the Int’l Conf. for High Performance Computing, Networking,
Storage and Analysis (SC 2014). IEEE, 2014. 945−955.
[16] Iwashita T, Nakanishi Y, Shimasaki M. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on
Scientific Computing, 2005,26(4):1234−1260.

63 64 65 66 67 68 69 70 71 72 73