Page 229 - 《软件学报》2026年第1期
P. 229

226                                                        软件学报  2026  年第  37  卷第  1  期


                      train megatron-turing NLG 530B, a large-scale generative language model. arXiv:2201.11990, 2022.
                 [40]   Rajbhandari S, Rasley J, Ruwase O, He YX. ZeRO: Memory optimizations toward training trillion parameter models. In: Proc. of the
                      Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. Atlanta: ACM, 2020. 20.
                 [41]   Ren J, Rajbhandari S, Aminabadi RY, Ruwase O, Yang SY, Zhang MJ, Li D, He YX. ZeRO-offload: Democratizing billion-scale model
                      training. In: Proc. of the 2021 USENIX Annual Technical Conf. USENIX, 2021. 551–564.
                 [42]   Aicheng Technology, Alibaba Group. Megatron-LLaMA [Internet]. 2023. https://github.com/alibaba/Megatron-LLaMA
                 [43]   Li SG, Liu HX, Bian ZD, Fang JR, Huang HC, Liu YL, Wang BX, You Y. Colossal-AI: A unified deep learning system for large-scale
                      parallel training. In: Proc. of the 52nd Int’l Conf. on Parallel Processing. Salt Lake City: ACM, 2023. 766–775. [doi: 10.1145/3605573.
                      3605613]
                 [44]   Jiang ZH, Lin HB, Zhong YM, et al. MegaScale: Scaling large language model training to more than 10 000 GPUs. In: Proc. of the 21st
                      USENIX Symp. on Networked Systems Design and Implementation. Santa Clara: USENIX, 2024. 745–760.
                 [45]   PaddleNLP  Contributors.  PaddleNLP:  An  easy-to-use  and  high  performance  NLP  library.  2024.  https://github.com/PaddlePaddle/
                      PaddleNLP
                 [46]   Lin ZQ, Miao YS, Zhang QL, Yang F, Zhu Y, Li C, Maleki S, Cao X, Shang N, Yang YL, Xu WJ, Yang M, Zhang LT, Zhou LD.
                      nnScaler: Constraint-guided parallelization plan generation for deep learning training. In: Proc. of the 18th USENIX Symp. on Operating
                      Systems Design and Implementation. Santa Clara: USENIX, 2024. 347–363.
                 [47]   Patarasuk P, Yuan X. Bandwidth optimal All-Reduce algorithms for clusters of workstations. Journal of Parallel and Distrib Computing,
                      2009, 69(2): 117–124. [doi: 10.1016/j.jpdc.2008.09.002]
                 [48]   Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M. fairseq: A fast, extensible toolkit for sequence modeling. In:
                      Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics. Minneapolis: ACL, 2019.
                      48–53. [doi: 10.18653/v1/N19-4009]
                 [49]   Ma  SM,  Wang  HY,  Huang  SH,  Wang  WH,  Chi  ZW,  Dong  L,  Benhaim  A,  Patra  B,  Chaudhary  V,  Song  X,  Wei  FR.  TorchScale:
                      Transformers at scale. arXiv:2211.13184, 2022.
                 [50]   Narayanan  D,  Harlap  A,  Phanishayee  A,  Seshadri  V,  Devanur  NR,  Ganger  GR,  Gibbons  PB,  Zaharia  M.  PipeDream:  Generalized
                      pipeline parallelism for DNN training. In: Proc. of the 27th ACM Symp. on Operating Systems Principles. Huntsville: ACM, 2019.
                      1–15. [doi: 10.1145/3341301.3359646]
                 [51]   Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B,
                      Phanishayee A, Zaharia M. Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proc. of the 2021 Int’l
                      Conf.  for  High  Performance  Computing,  Networking,  Storage  and  Analysis.  St.  Louis:  ACM,  2021.  1–15.  [doi:  10.1145/3458817.
                      3476209]
                 [52]   Li SG, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In: Proc. of the 2021 Int’l Conf.
                      for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM, 2021. 27. [doi: 10.1145/3458817.3476145]
                 [53]   Choi S, Koo I, Ahn J, Jeon M, Kwon Y. EnvPipe: Performance-preserving DNN training framework for saving energy. In: Proc. of the
                      2023 USENIX Annual Technical Conf. Boston: USENIX, 2023. 851–864.
                 [54]   Osawa K, Li SG, Hoefler T. PipeFisher: Efficient training of large language models using pipelining and Fisher information matrices.
                      In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023. 708–727.
                 [55]   Lu WY, Yan GH, Li JJ, Gong SJ, Han YH, Li XW. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural
                      networks. In: Proc. of the 2017 IEEE Int’l Symp. on High Performance Computer Architecture (HPCA). Austin: IEEE, 2017. 553–564.
                      IEEE. [doi: 10.1109/HPCA.2017.29]
                 [56]   Beaumont O, Eyraud-Dubois L, Herrmann J, Joly A, Shilova A. Optimal re-materialization strategies for heterogeneous chains: How to
                      train deep neural networks with limited memory. ACM Trans. on Mathematical Software, 2024, 50(2): 10. [doi: 10.1145/3648633]
                 [57]   Korthikanti  VA,  Casper  J,  Lym  S,  McAfee  L,  Andersch  M,  Shoeybi  M,  Catanzaro  B.  Reducing  activation  recomputation  in  large
                      Transformer models. In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023.
                 [58]   Rajbhandari S, Ruwase O, Rasley J, Smith S, He YX. ZeRO-infinity: Breaking the GPU memory wall for extreme scale deep learning.
                      In: Proc. of the 2021 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM, 2021. 59. [doi:
                      10.1145/3458817.3476205]
                 [59]   Yuan TL, Liu YL, Ye XC, Zhang SL, Tan JC, Chen B, Song CR, Zhang D. Accelerating the training of large language models using
                      efficient activation rematerialization and optimal hybrid parallelism. In: Proc. of the 2024 USENIX Annual Technical Conf. Santa Clara:
                      USENIX, 2024. 545–561.
                 [60]   Yuksel SE, Wilson JN, Gader PD. Twenty years of mixture of experts. IEEE Trans. on Neural Networks and Learning Systems, 2012,
   224   225   226   227   228   229   230   231   232   233   234