Page 473 - 《软件学报》2025年第8期

P. 473

软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(8):3896−3916 [doi: 10.13328/j.cnki.jos.007244] [CSTR: 32375.14.jos.007244] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563

MTTorch: 面向 MT-3000 芯片和 Transformer 模型的
*
PyTorch 算子库实现与优化

王昊天 1 , 孙羽菲 1 , 隋轶丞 1 , 王嘉豪 1 , 石昌青 1 , 方建滨 2 , 张玉志 1

1
(南开大学软件学院, 天津 300450)
2
(国防科技大学计算机学院, 湖南长沙 410073)
通信作者: 孙羽菲, E-mail: yufei_sun@sina.com

摘要: 随着 Transformer 类大模型的飞速发展, 算力逐渐成为制约领域发展的瓶颈, 如何根据加速器硬件的结构
特性加速和优化大语言模型的训练性能已成为研究热点. 面向天河新一代超算系统的加速芯片 MT-3000, 提出并
实现了适用于 CPU+DSP 异构架构的 PyTorch 扩展库——MTTorch, 其核心是一个多核并行的算子库, 对
Transformer 类模型训练过程中的核心算子进行向量化实现和优化. 同时, 针对 MT-3000 架构特性, 提出了面向多
核 DSP 的高性能规约算法及乒乓算法, 显著提升了算子的运算性能. MTTorch 还具有很好的通用性, 对于不同版
本的 PyTorch 都可以动态链接库的形式进行加载, 不改变 PyTorch 的原生实现. 大量实验证明, 实现的核心算子在
MT-3000 芯片上有着很好的性能, 在单 DSP 簇上可以达到 8 倍的加速效果. 利用 MTTorch 在多节点执行训练任务
时有着接近线性的加速比, 极大地提升了 Transformer 类模型在 MT-3000 芯片上的训练效率.
关键词: PyTorch; 高性能计算; Transformer 模型; 天河超级计算机; CPU+DSP 异构计算; 软件生态
中图法分类号: TP303

中文引用格式: 王昊天, 孙羽菲, 隋轶丞, 王嘉豪, 石昌青, 方建滨, 张玉志. MTTorch: 面向MT-3000芯片和Transformer模型的
PyTorch算子库实现与优化. 软件学报, 2025, 36(8): 3896–3916. http://www.jos.org.cn/1000-9825/7244.htm
英文引用格式: Wang HT, Sun YF, Sui YC, Wang JH, Shi CQ, Fang JB, Zhang YZ. MTTorch: PyTorch Arithmetic Library
Implementation and Optimization for MT-3000 Chip and Transformer Model. Ruan Jian Xue Bao/Journal of Software,
2025, 36(8): 3896–3916 (in Chinese). http://www.jos.org.cn/1000-9825/7244.htm

MTTorch: PyTorch Arithmetic Library Implementation and Optimization for MT-3000 Chip
and Transformer Model
1
1
1
1
2
1
WANG Hao-Tian , SUN Yu-Fei , SUI Yi-Cheng , WANG Jia-Hao , SHI Chang-Qing , FANG Jian-Bin ,
ZHANG Yu-Zhi 1
1
(College of Software, Nankai University, Tianjin 300450, China)
2
(College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China)
Abstract: With the rapid development of Transformer-based large models, computing power has gradually become a bottleneck in the
development of this field. Research hotspots rely on how to accelerate and optimize the training performance of large language models
based on the structural characteristics of accelerator hardware. This study proposes and implements MTTorch, a PyTorch extension library
for the CPU+DSP heterogeneous architecture, which is applicable to the MT-3000 accelerator chip of the new generation of the Tianhe
supercomputer. The core of MTTorch is a multi-core parallel operator library that vectorizes and optimizes the core operators during the
training of Transformer-based models. Additionally, this study innovatively proposes a high-performance reduction algorithm and a ping-

* 基金项目: 国家重点研发计划 (2021YFB0300104); 先进计算与关键软件海河实验室科技项目 (22HHXCJC00001); 启元实验室创新基
金 (2022-JCJO-LA-001-068)
收稿时间: 2023-11-13; 修改时间: 2024-03-13, 2024-05-13, 2024-06-14; 采用时间: 2024-06-25; jos 在线出版时间: 2024-12-31
CNKI 网络首发时间: 2025-01-02

468 469 470 471 472 473 474 475 476 477 478