Page 229 - 《软件学报》2026年第1期
P. 229
226 软件学报 2026 年第 37 卷第 1 期
train megatron-turing NLG 530B, a large-scale generative language model. arXiv:2201.11990, 2022.
[40] Rajbhandari S, Rasley J, Ruwase O, He YX. ZeRO: Memory optimizations toward training trillion parameter models. In: Proc. of the
Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Atlanta: ACM, 2020. 20.
[41] Ren J, Rajbhandari S, Aminabadi RY, Ruwase O, Yang SY, Zhang MJ, Li D, He YX. ZeRO-offload: Democratizing billion-scale model
training. In: Proc. of the 2021 USENIX Annual Technical Conf. USENIX, 2021. 551–564.
[42] Aicheng Technology, Alibaba Group. Megatron-LLaMA [Internet]. 2023. https://github.com/alibaba/Megatron-LLaMA
[43] Li SG, Liu HX, Bian ZD, Fang JR, Huang HC, Liu YL, Wang BX, You Y. Colossal-AI: A unified deep learning system for large-scale
parallel training. In: Proc. of the 52nd Int’l Conf. on Parallel Processing. Salt Lake City: ACM, 2023. 766–775. [doi: 10.1145/3605573.
3605613]
[44] Jiang ZH, Lin HB, Zhong YM, et al. MegaScale: Scaling large language model training to more than 10 000 GPUs. In: Proc. of the 21st
USENIX Symp. on Networked Systems Design and Implementation. Santa Clara: USENIX, 2024. 745–760.
[45] PaddleNLP Contributors. PaddleNLP: An easy-to-use and high performance NLP library. 2024. https://github.com/PaddlePaddle/
PaddleNLP
[46] Lin ZQ, Miao YS, Zhang QL, Yang F, Zhu Y, Li C, Maleki S, Cao X, Shang N, Yang YL, Xu WJ, Yang M, Zhang LT, Zhou LD.
nnScaler: Constraint-guided parallelization plan generation for deep learning training. In: Proc. of the 18th USENIX Symp. on Operating
Systems Design and Implementation. Santa Clara: USENIX, 2024. 347–363.
[47] Patarasuk P, Yuan X. Bandwidth optimal All-Reduce algorithms for clusters of workstations. Journal of Parallel and Distrib Computing,
2009, 69(2): 117–124. [doi: 10.1016/j.jpdc.2008.09.002]
[48] Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M. fairseq: A fast, extensible toolkit for sequence modeling. In:
Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics. Minneapolis: ACL, 2019.
48–53. [doi: 10.18653/v1/N19-4009]
[49] Ma SM, Wang HY, Huang SH, Wang WH, Chi ZW, Dong L, Benhaim A, Patra B, Chaudhary V, Song X, Wei FR. TorchScale:
Transformers at scale. arXiv:2211.13184, 2022.
[50] Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M. PipeDream: Generalized
pipeline parallelism for DNN training. In: Proc. of the 27th ACM Symp. on Operating Systems Principles. Huntsville: ACM, 2019.
1–15. [doi: 10.1145/3341301.3359646]
[51] Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B,
Phanishayee A, Zaharia M. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proc. of the 2021 Int’l
Conf. for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM, 2021. 1–15. [doi: 10.1145/3458817.
3476209]
[52] Li SG, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In: Proc. of the 2021 Int’l Conf.
for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM, 2021. 27. [doi: 10.1145/3458817.3476145]
[53] Choi S, Koo I, Ahn J, Jeon M, Kwon Y. EnvPipe: Performance-preserving DNN training framework for saving energy. In: Proc. of the
2023 USENIX Annual Technical Conf. Boston: USENIX, 2023. 851–864.
[54] Osawa K, Li SG, Hoefler T. PipeFisher: Efficient training of large language models using pipelining and Fisher information matrices.
In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023. 708–727.
[55] Lu WY, Yan GH, Li JJ, Gong SJ, Han YH, Li XW. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural
networks. In: Proc. of the 2017 IEEE Int’l Symp. on High Performance Computer Architecture (HPCA). Austin: IEEE, 2017. 553–564.
IEEE. [doi: 10.1109/HPCA.2017.29]
[56] Beaumont O, Eyraud-Dubois L, Herrmann J, Joly A, Shilova A. Optimal re-materialization strategies for heterogeneous chains: How to
train deep neural networks with limited memory. ACM Trans. on Mathematical Software, 2024, 50(2): 10. [doi: 10.1145/3648633]
[57] Korthikanti VA, Casper J, Lym S, McAfee L, Andersch M, Shoeybi M, Catanzaro B. Reducing activation recomputation in large
Transformer models. In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023.
[58] Rajbhandari S, Ruwase O, Rasley J, Smith S, He YX. ZeRO-infinity: Breaking the GPU memory wall for extreme scale deep learning.
In: Proc. of the 2021 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM, 2021. 59. [doi:
10.1145/3458817.3476205]
[59] Yuan TL, Liu YL, Ye XC, Zhang SL, Tan JC, Chen B, Song CR, Zhang D. Accelerating the training of large language models using
efficient activation rematerialization and optimal hybrid parallelism. In: Proc. of the 2024 USENIX Annual Technical Conf. Santa Clara:
USENIX, 2024. 545–561.
[60] Yuksel SE, Wilson JN, Gader PD. Twenty years of mixture of experts. IEEE Trans. on Neural Networks and Learning Systems, 2012,

