Page 228 - 《软件学报》2026年第1期
P. 228
高彦杰 等: 大语言模型预训练系统关键技术综述 225
[19] The Mosaic Research Team. Introducing MPT-7B: A new standard for open-source, commercially usable LLMs. 2023. https://www.
databricks.com/blog/mpt-7b
[20] Ma ZX, Zhai JD, Han WT, Chen WG, Zheng WM. Challenges and measures for efficient training of trillion-parameter pre-trained
models. ZTE Technology Journal, 2022, 28(2): 51–58 (in Chinese with English abstract). [doi: 10.12142/ZTETJ.202202008]
[21] Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen
P, Ma C, Jernite Y, Plu J, Xu CW, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush AM. Transformers: State-of-the-art natural language
processing. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations. ACL, 2020.
38–45. [doi: 10.18653/v1/2020.emnlp-demos.6]
[22] Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin ZM, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang
E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai JJ, Chintala S. PyTorch: An imperative style, high-
performance deep learning library. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2019.
721.
[23] Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018.
[24] Huang YP, Cheng YL, Bapna A, Firat O, Chen MX, Chen DH, Lee H, Ngiam J, Le QV, Wu YH, Chen ZF. GPipe: Efficient training of
giant neural networks using pipeline parallelism. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver:
ACM, 2019. 10.
[25] Li S, Zhao YL, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P, Chintala S. PyTorch distributed:
Experiences on accelerating data parallel training. Proc. of the VLDB Endowment, 2020, 13(12): 3005–3018. [doi: 10.14778/3415478.
3415530]
[26] Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models
using model parallelism. arXiv:1909.08053, 2019.
[27] Rasley J, Rajbhandari S, Ruwase O, He YX. DeepSpeed: System optimizations enable training deep learning models with over 100
billion parameters. In: Proc. of the 26th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. ACM, 2020. 3505–3506.
[doi: 10.1145/3394486.3406703]
[28] Zheng LM, Li ZH, Zhang H, Zhuang YH, Chen ZF, Huang YP, Wang YD, Xu YZ, Zhuo DY, Xing EP, Gonzalez JE, Stoica I. Alpa:
Automating inter- and intra-operator parallelism for distributed deep learning. In: Proc. of the 16th USENIX Symp. on Operating
Systems Design and Implementation. Carlsbad: USENIX, 2022. 559–578.
[29] Abadi M, Barham P, Chen JM, Chen ZF, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga
R, Moore S, Murray DG, Steiner B, Tucker PA, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng XQ. TensorFlow: A system for large-
scale machine learning. In: Proc. of the 12th USENIX Symp. on Operating Systems Design and Implementation. Savannah: USENIX,
2016. 265–283.
[30] Google. MaxText. GitHub [Internet]. 2023. https://github.com/google/maxtext/
[31] Bradbury J, Frostig R, Hawkins P, Johnson MJ, Leary C, Maclaurin D, Necula G, Paszke A, VanderPlas J, Wanderman-Milne S, Zhang
Q. JAX: Composable transformations of Python+NumPy programs. 2018. http://github.com/google/jax
[32] Li DC, Shao RL, Xie AZ, Xing EP, Gonzalez JE, Stoica I, Ma XZ, Zhang H. LightSeq: Sequence level parallelism for distributed
training of long context Transformers. arXiv:2310.03294, 2023. [doi: 10.48550/arXiv.2310.03294]
[33] MindNLP Contributors. MindNLP: Easy-to-use and high-performance NLP and LLM framework based on MindSpore [Internet]. 2022.
https://github.com/mindlab-ai/mindnlp
[34] Zhao YL, Gu A, Varma R, Luo L, Huang CC, Xu M, Wright L, Shojanazeri H, Ott M, Shleifer S, Desmaison A, Balioglu C, Damania P,
Nguyen B, Chauhan G, Hao YC, Mathews A, Li S. PyTorch FSDP: Experiences on scaling fully sharded data parallel. Proc. of the
VLDB Endowment, 2023, 16(12): 3848–3860. [doi: 10.14778/3611540.3611569]
[35] Nie XN, Liu Y, Fu FC, Xue JB, Jiao D, Miao XP, Tao YY, Cui B. Angel-PTM: A scalable and economical large-scale pre-training
system in tencent. Proc. of the VLDB Endowment, 2023, 16(12): 3781–3794. [doi: 10.14778/3611540.3611564]
[36] Baines M, Bhosale S, Caggiano V, Goyal N, Goyal S, Ott M, Lefaudeux B, Liptchinsky V, Rabbat M, Sheiffer S, Sridhar A. FairScale:
A general purpose modular PyTorch library for high performance and large scale training. 2020. https://github.com/facebookresearch/
fairscale
[37] Mosaic ML Team. Composer. 2021. https://github.com/mosaicml/composer/
[38] Databricks. Databricks Mosaic: pioneering AI & open-source research. 2024. https://www.databricks.com/mosaic
[39] Smith S, Patwary M, Norick B, LeGresley P, Rajbhandari S, Casper J, Liu Z, Prabhumoye S, Zerveas G, Korthikanti V, Zhang E, Child
R, Aminabadi RY, Bernauer J, Song X, Shoeybi M, He YX, Houston M, Tiwary S, Catanzaro B. Using DeepSpeed and megatron to

