Page 492 - 《软件学报》2025年第8期
P. 492

王昊天 等: MTTorch: 面向  MT-3000  芯片和  Transformer 模型的  PyTorch  算子库实现与优化            3915


                     Kelton, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with
                     human feedback. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022.
                  [4]  Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. https://www.
                     semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668
                     a1cc19f2ec95b5003d0a5035
                  [5]  Tian  Z,  Chen  YF.  Performance  optimization  of  molecular  dynamics  simulation  on  Sunway  TaihuLight  system.  Ruan  Jian  Xue
                     Bao/Journal of Software, 2021, 32(9): 2945–2962 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5978.htm [doi: 10.
                     13328/j.cnki.jos.005978]
                  [6]  Hong WJ, Li KL, Quan Z, Yang WD, Li KQ, Hao ZY, Xie XH. PETSc’s heterogeneous parallel algorithm design and performance
                     optimization  on  the  Sunway  TaihuLight  system.  Chinese  Journal  of  Computers,  2017,  40(9):  2057–2069  (in  Chinese  with  English
                     abstract). [doi: 10.11897/SP.J.1016.2017.02057]
                  [7]  Sanders  J,  Kandrot  E.  CUDA  by  Example:  An  Introduction  to  General-purpose  GPU  Programming.  Boston:  Addison-Wesley
                     Professional, 2010
                  [8]  Lu K, Wang YH, Guo Y, Huang C, Liu S, Wang RB, Fang JB, Tang T, Chen ZY, Liu BW, Liu Z, Lei YW, Sun HY. MT-3000: A
                     heterogeneous multi-zone processor for HPC. CCF Trans. on High Performance Computing, 2022, 4(2): 150–164. [doi: 10.1007/s42514-
                     022-00095-y]
                  [9]  Wang  ZL.  Design  and  implementation  of  high-performance  DMA  transmission  modes  for  scientific  computation  on  GPDSP  [MS.
                     Thesis]. Changsha: National University of Defense Technology, 2015 (in Chinese with English abstract).
                 [10]  Gong CY, Liu J, Bao WM, Pan DM, Gan XB, Li SG, Chen XG, Xiao TJ, Yang B, Wang RB. Review on ecological construction of
                     domestic high-performance parallel application software in post Moore Era. Journal of System Simulation, 2022, 34(10): 2107–2118 (in
                     Chinese with English abstract). [doi: 10.16182/j.issn1004731x.joss.21-1365]
                 [11]  Fang J, Varbanescu AL, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In: Proc. of the 2011 Int’l Conf. on
                     Parallel Processing. IEEE, 2011. 216–225. [doi: 10.1109/ICPP.2011.45]
                 [12]  Shen J, Fang JB, Sips H, Varbanescu AL. Performance gaps between OpenMP and OpenCL for multi-core CPUs. In: Proc. of the 41st Int’l
                     Conf. on Parallel Processing Workshops. Pittsburgh: IEEE, 2012. 116–125. [doi: 10.1109/ICPPW.2012.18]
                 [13]  Li YY, Xue W, Chen DX, Wang XL, Xu P, Zhang WS, Yang GW. Performance optimization for sparse matrix-vector multiplication on
                     Sunway architecture. Chinese Journal of Computers, 2020, 43(6): 1010–1024 (in Chinese with English abstract). [doi: 10.11897/SP.J.
                     1016.2020.01010]
                 [14]  Yin SF, Wang QL, Hao RC, Zhou TY, Mei SZ, Liu J. Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPs. In:
                     Proc.  of  the  2022  IEEE  Int’l  Conf.  on  Cluster  Computing  (CLUSTER).  Heidelberg:  IEEE,  2022.  451–461.  [doi:  10.1109/CLUSTER
                     51413.2022.00055]
                 [15]  Pei XD, Wang QL, Liao LY, Li RC, Mei SZ, Liu J, Pang ZB. Optimizing parallel matrix transpose algorithm on multi-core digital signal
                     processors. Journal of National University of Defense Technology, 2023, 45(1): 57–66 (in Chinese with English abstract). [doi: 10.11887/
                     j.cn.202301006]
                 [16]  Wang QL, Li DS, Mei SZ, Lai ZQ, Dou Y. Optimizing Winograd-based fast convolution algorithm on Phytium multi-core CPUS. Journal
                     of Computer Research and Development, 2020, 57(6): 1140–1151 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.2020.
                     20200107]
                 [17]  Gu JL, Liu YB, Gao Y, Zhu MH. OpenCL caffe: Accelerating and enabling a cross platform machine learning framework. In: Proc. of the
                     4th Int’l Workshop on OpenCL. Vienna: ACM, 2016. 8. [doi: 10.1145/2909437.2909443]
                 [18]  Chen J, Zhong L. An efficient parallel CNN inference framework for multi-zone processor. In: Proc. of the 24th Int’l Conf. on High
                     Performance Computing & Communications; Proc. of the 8th Int’l Conf. on Data Science & Systems; Proc. of the 20th Int’l Conf. on
                     Smart City; Proc. of the 8th Int’l Conf. on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/
                     DependSys). IEEE, 2022. 1350–1357. [doi: 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00210]
                 [19]  Chen R, Sun YF, Cheng DG, Guo Q, Chen YQ, Shi CQ, Sui YC, Zhang YZ, Zhang YZ. Implementation and optimization of OpenCL
                     kernels in TensorFlow. Chinese Journal of Computers, 2022, 45(11): 2456–2474 (in Chinese with English abstract). [doi: 10.11897/SP.J.
                     1016.2022.02456]
                 [20]  Fang JB, Zhang P, Huang C, Tang T, Lu K, Wang RB, Wang Z. Programming bare-metal accelerators with heterogeneous threading
                     models: A case study of Matrix-3000. Frontiers of Information Technology & Electronic Engineering, 2023, 24(4): 509–520. [doi: 10.
                     1631/FITEE.2200359]
                 [21]  Zhang P, Fang JB, Yang CQ, Tang T, Huang C, Wang Z. MOCL: An efficient OpenCL implementation for the matrix-2000 architecture.
   487   488   489   490   491   492   493   494