Page 228 - 《软件学报》2026年第1期
P. 228

高彦杰 等: 大语言模型预训练系统关键技术综述                                                          225


                 [19]   The Mosaic Research Team. Introducing MPT-7B: A new standard for open-source, commercially usable LLMs. 2023. https://www.
                      databricks.com/blog/mpt-7b
                 [20]   Ma ZX, Zhai JD, Han WT, Chen WG, Zheng WM. Challenges and measures for efficient training of trillion-parameter pre-trained
                      models. ZTE Technology Journal, 2022, 28(2): 51–58 (in Chinese with English abstract). [doi: 10.12142/ZTETJ.202202008]
                 [21]   Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen
                      P, Ma C, Jernite Y, Plu J, Xu CW, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush AM. Transformers: State-of-the-art natural language
                      processing. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing: System Demonstrations. ACL, 2020.
                      38–45. [doi: 10.18653/v1/2020.emnlp-demos.6]
                 [22]   Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin ZM, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang
                      E,  DeVito  Z,  Raison  M,  Tejani  A,  Chilamkurthy  S,  Steiner  B,  Fang  L,  Bai  JJ,  Chintala  S.  PyTorch:  An  imperative  style,  high-
                      performance deep learning library. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2019.
                      721.
                 [23]   Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018.
                 [24]   Huang YP, Cheng YL, Bapna A, Firat O, Chen MX, Chen DH, Lee H, Ngiam J, Le QV, Wu YH, Chen ZF. GPipe: Efficient training of
                      giant neural networks using pipeline parallelism. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver:
                      ACM, 2019. 10.
                 [25]   Li S, Zhao YL, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P, Chintala S. PyTorch distributed:
                      Experiences on accelerating data parallel training. Proc. of the VLDB Endowment, 2020, 13(12): 3005–3018. [doi: 10.14778/3415478.
                      3415530]
                 [26]   Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models
                      using model parallelism. arXiv:1909.08053, 2019.
                 [27]   Rasley J, Rajbhandari S, Ruwase O, He YX. DeepSpeed: System optimizations enable training deep learning models with over 100
                      billion parameters. In: Proc. of the 26th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. ACM, 2020. 3505–3506.
                      [doi: 10.1145/3394486.3406703]
                 [28]   Zheng LM, Li ZH, Zhang H, Zhuang YH, Chen ZF, Huang YP, Wang YD, Xu YZ, Zhuo DY, Xing EP, Gonzalez JE, Stoica I. Alpa:
                      Automating  inter-  and  intra-operator  parallelism  for  distributed  deep  learning.  In:  Proc.  of  the  16th  USENIX  Symp.  on  Operating
                      Systems Design and Implementation. Carlsbad: USENIX, 2022. 559–578.
                 [29]   Abadi M, Barham P, Chen JM, Chen ZF, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga
                      R, Moore S, Murray DG, Steiner B, Tucker PA, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng XQ. TensorFlow: A system for large-
                      scale machine learning. In: Proc. of the 12th USENIX Symp. on Operating Systems Design and Implementation. Savannah: USENIX,
                      2016. 265–283.
                 [30]   Google. MaxText. GitHub [Internet]. 2023. https://github.com/google/maxtext/
                 [31]   Bradbury J, Frostig R, Hawkins P, Johnson MJ, Leary C, Maclaurin D, Necula G, Paszke A, VanderPlas J, Wanderman-Milne S, Zhang
                      Q. JAX: Composable transformations of Python+NumPy programs. 2018. http://github.com/google/jax
                 [32]   Li DC, Shao RL, Xie AZ, Xing EP, Gonzalez JE, Stoica I, Ma XZ, Zhang H. LightSeq: Sequence level parallelism for distributed
                      training of long context Transformers. arXiv:2310.03294, 2023. [doi: 10.48550/arXiv.2310.03294]
                 [33]   MindNLP Contributors. MindNLP: Easy-to-use and high-performance NLP and LLM framework based on MindSpore [Internet]. 2022.
                      https://github.com/mindlab-ai/mindnlp
                 [34]   Zhao YL, Gu A, Varma R, Luo L, Huang CC, Xu M, Wright L, Shojanazeri H, Ott M, Shleifer S, Desmaison A, Balioglu C, Damania P,
                      Nguyen B, Chauhan G, Hao YC, Mathews A, Li S. PyTorch FSDP: Experiences on scaling fully sharded data parallel. Proc. of the
                      VLDB Endowment, 2023, 16(12): 3848–3860. [doi: 10.14778/3611540.3611569]
                 [35]   Nie XN, Liu Y, Fu FC, Xue JB, Jiao D, Miao XP, Tao YY, Cui B. Angel-PTM: A scalable and economical large-scale pre-training
                      system in tencent. Proc. of the VLDB Endowment, 2023, 16(12): 3781–3794. [doi: 10.14778/3611540.3611564]
                 [36]   Baines M, Bhosale S, Caggiano V, Goyal N, Goyal S, Ott M, Lefaudeux B, Liptchinsky V, Rabbat M, Sheiffer S, Sridhar A. FairScale:
                      A general purpose modular PyTorch library for high performance and large scale training. 2020. https://github.com/facebookresearch/
                      fairscale
                 [37]   Mosaic ML Team. Composer. 2021. https://github.com/mosaicml/composer/
                 [38]   Databricks. Databricks Mosaic: pioneering AI & open-source research. 2024. https://www.databricks.com/mosaic
                 [39]   Smith S, Patwary M, Norick B, LeGresley P, Rajbhandari S, Casper J, Liu Z, Prabhumoye S, Zerveas G, Korthikanti V, Zhang E, Child
                      R, Aminabadi RY, Bernauer J, Song X, Shoeybi M, He YX, Houston M, Tiwary S, Catanzaro B. Using DeepSpeed and megatron to
   223   224   225   226   227   228   229   230   231   232   233