Page 492 - 《软件学报》2025年第8期

P. 492

王昊天等: MTTorch: 面向 MT-3000 芯片和 Transformer 模型的 PyTorch 算子库实现与优化 3915

Kelton, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with
human feedback. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022.
[4] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. https://www.
semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668
a1cc19f2ec95b5003d0a5035
[5] Tian Z, Chen YF. Performance optimization of molecular dynamics simulation on Sunway TaihuLight system. Ruan Jian Xue
Bao/Journal of Software, 2021, 32(9): 2945–2962 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5978.htm [doi: 10.
13328/j.cnki.jos.005978]
[6] Hong WJ, Li KL, Quan Z, Yang WD, Li KQ, Hao ZY, Xie XH. PETSc’s heterogeneous parallel algorithm design and performance
optimization on the Sunway TaihuLight system. Chinese Journal of Computers, 2017, 40(9): 2057–2069 (in Chinese with English
abstract). [doi: 10.11897/SP.J.1016.2017.02057]
[7] Sanders J, Kandrot E. CUDA by Example: An Introduction to General-purpose GPU Programming. Boston: Addison-Wesley
Professional, 2010
[8] Lu K, Wang YH, Guo Y, Huang C, Liu S, Wang RB, Fang JB, Tang T, Chen ZY, Liu BW, Liu Z, Lei YW, Sun HY. MT-3000: A
heterogeneous multi-zone processor for HPC. CCF Trans. on High Performance Computing, 2022, 4(2): 150–164. [doi: 10.1007/s42514-
022-00095-y]
[9] Wang ZL. Design and implementation of high-performance DMA transmission modes for scientific computation on GPDSP [MS.
Thesis]. Changsha: National University of Defense Technology, 2015 (in Chinese with English abstract).
[10] Gong CY, Liu J, Bao WM, Pan DM, Gan XB, Li SG, Chen XG, Xiao TJ, Yang B, Wang RB. Review on ecological construction of
domestic high-performance parallel application software in post Moore Era. Journal of System Simulation, 2022, 34(10): 2107–2118 (in
Chinese with English abstract). [doi: 10.16182/j.issn1004731x.joss.21-1365]
[11] Fang J, Varbanescu AL, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In: Proc. of the 2011 Int’l Conf. on
Parallel Processing. IEEE, 2011. 216–225. [doi: 10.1109/ICPP.2011.45]
[12] Shen J, Fang JB, Sips H, Varbanescu AL. Performance gaps between OpenMP and OpenCL for multi-core CPUs. In: Proc. of the 41st Int’l
Conf. on Parallel Processing Workshops. Pittsburgh: IEEE, 2012. 116–125. [doi: 10.1109/ICPPW.2012.18]
[13] Li YY, Xue W, Chen DX, Wang XL, Xu P, Zhang WS, Yang GW. Performance optimization for sparse matrix-vector multiplication on
Sunway architecture. Chinese Journal of Computers, 2020, 43(6): 1010–1024 (in Chinese with English abstract). [doi: 10.11897/SP.J.
1016.2020.01010]
[14] Yin SF, Wang QL, Hao RC, Zhou TY, Mei SZ, Liu J. Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPs. In:
Proc. of the 2022 IEEE Int’l Conf. on Cluster Computing (CLUSTER). Heidelberg: IEEE, 2022. 451–461. [doi: 10.1109/CLUSTER
51413.2022.00055]
[15] Pei XD, Wang QL, Liao LY, Li RC, Mei SZ, Liu J, Pang ZB. Optimizing parallel matrix transpose algorithm on multi-core digital signal
processors. Journal of National University of Defense Technology, 2023, 45(1): 57–66 (in Chinese with English abstract). [doi: 10.11887/
j.cn.202301006]
[16] Wang QL, Li DS, Mei SZ, Lai ZQ, Dou Y. Optimizing Winograd-based fast convolution algorithm on Phytium multi-core CPUS. Journal
of Computer Research and Development, 2020, 57(6): 1140–1151 (in Chinese with English abstract). [doi: 10.7544/issn1000-1239.2020.
20200107]
[17] Gu JL, Liu YB, Gao Y, Zhu MH. OpenCL caffe: Accelerating and enabling a cross platform machine learning framework. In: Proc. of the
4th Int’l Workshop on OpenCL. Vienna: ACM, 2016. 8. [doi: 10.1145/2909437.2909443]
[18] Chen J, Zhong L. An efficient parallel CNN inference framework for multi-zone processor. In: Proc. of the 24th Int’l Conf. on High
Performance Computing & Communications; Proc. of the 8th Int’l Conf. on Data Science & Systems; Proc. of the 20th Int’l Conf. on
Smart City; Proc. of the 8th Int’l Conf. on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/
DependSys). IEEE, 2022. 1350–1357. [doi: 10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00210]
[19] Chen R, Sun YF, Cheng DG, Guo Q, Chen YQ, Shi CQ, Sui YC, Zhang YZ, Zhang YZ. Implementation and optimization of OpenCL
kernels in TensorFlow. Chinese Journal of Computers, 2022, 45(11): 2456–2474 (in Chinese with English abstract). [doi: 10.11897/SP.J.
1016.2022.02456]
[20] Fang JB, Zhang P, Huang C, Tang T, Lu K, Wang RB, Wang Z. Programming bare-metal accelerators with heterogeneous threading
models: A case study of Matrix-3000. Frontiers of Information Technology & Electronic Engineering, 2023, 24(4): 509–520. [doi: 10.
1631/FITEE.2200359]
[21] Zhang P, Fang JB, Yang CQ, Tang T, Huang C, Wang Z. MOCL: An efficient OpenCL implementation for the matrix-2000 architecture.

487 488 489 490 491 492 493 494