Page 230 - 《软件学报》2026年第1期
P. 230

高彦杰 等: 大语言模型预训练系统关键技术综述                                                          227


                      23(8): 1177–1193. [doi: 10.1109/TNNLS.2012.2200299]
                 [61]   Hwang C, Cui W, Xiong YF, Yang ZY, Liu Z, Hu H, Wang ZL, Salas R, Jose J, Ram P, Chau H, Cheng P, Yang F, Yang M, Xiong YQ.
                      Tutel: Adaptive mixture-of-experts at scale. In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023.
                 [62]   Li  JM,  Jiang  YM,  Zhu  YB,  Wang  C,  Xu  H.  Accelerating  distributed  MoE  training  and  inference  with  lina.  In:  Proc.  of  the  2023
                      USENIX Annual Technical Conf. Boston: USENIX, 2023. 945–959.
                 [63]   Zhai MS, He JA, Ma ZX, Zong Z, Zhang RQ, Zhai JD. SmartMoE: Efficiently training sparsely-activated models through combining
                      offline and online parallelization. In: Proc. of the 2023 USENIX Annual Technical Conf. Boston: USENIX, 2023. 961–975.
                 [64]   Ivanov A, Dryden N, Ben-Nun T, Li SG, Hoefler T. Data movement is all you need: A case study on optimizing Transformers. In: Proc.
                      of the 4th Conf. on Machine Learning and Systems. MLSys, 2021. 711–732.
                 [65]   Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance model for multicore architectures. Communications
                      of the ACM, 2009, 52(4): 65–76. [doi: 10.1145/1498765.1498785]
                 [66]   Dao T, Fu DY, Ermon S, Rudra A, Ré C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In: Proc. of the
                      36th Int’l Conf. on Neural Information Processing Systems. New Orleans: ACM, 2022. 1189.
                 [67]   Dao T. FlashAttention-2: Faster attention with better parallelism and work partitioning. In: Proc. of the 12th Int’l Conf. on Learning
                      Representations. Vienna: ICLR, 2024.
                 [68]   Beltagy I, Peters ME, Cohan A. Longformer: The long-document Transformer. arXiv:2004.05150, 2020.
                 [69]   Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient Transformer. In: Proc. of the 8th Int’l Conf. on Learning Representations.
                      Addis Ababa: ICLR, 2020.
                 [70]   Wang SN, Li BZ, Khabsa M, Fang H, Ma H. Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
                 [71]   Zhu C, Ping W, Xiao CW, Shoeybi M, Goldstein T, Anandkumar A, Catanzaro B. Long-short Transformer: Efficient Transformers for
                      language and vision. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. ACM, 2021. 17723–17736.
                 [72]   Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed
                      precision training. arXiv:1710.03740. 2017.
                 [73]   Liu ZC, Oguz B, Zhao CS, Chang E, Stock P, Mehdad Y, Shi YY, Krishnamoorthi R, Chandra V. LLM-QAT: Data-Free quantization
                      aware training for large language models. In: Proc. of the Findings of the Association for Computational Linguistics. Bangkok: ACL,
                      2024. 467–484. [doi: 10.18653/v1/2024.findings-acl.26]
                 [74]   Arshia FZ, Keyvanrad MA, Sadidpour SS, Mohammadi SMR. PeQA: A massive Persian question-answering and chatbot dataset. In:
                      Proc.  of  the  12th  Int’l  Conf.  on  Computer  and  Knowledge  Engineering  (ICCKE).  Mashhad:  IEEE,  2022.  392–397.  [doi:  10.1109/
                      ICCKE57176.2022.9960071]
                 [75]   Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. In: Proc. of the 37th Int’l Conf.
                      on Neural Information Processing Systems. New Orleans: ACM, 2023. 441.
                 [76]   Wang  WY,  Khazraee  M,  Zhong  ZZ,  Ghobadi  M,  Jia  ZH,  Mudigere  D,  Zhang  Y,  Kewitsch  A.  TopoOpt:  Co-Optimizing  network
                      topology and parallelization strategy for distributed training jobs. In: Proc. of the 20th USENIX Symp. on Networked Systems Design
                      and Implementation. Boston: USENIX, 2023. 739–767.
                 [77]   Cowan M, Maleki S, Musuvathi M, Saarikivi O, Xiong YF. MSCCLang: Microsoft collective communication language. In: Proc. of the
                      28th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Vancouver: ACM, 2023. 502–514.
                      [doi: 10.1145/3575693.3575724]
                 [78]   Cho M, Finkler U, Kung DS, Hunter HC. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy.
                      In: Proc. of the 2nd Conf. on Machine Learning and Systems. Stanford: SysML, 2019. 241–251.
                 [79]   Luo L, West P, Krishnamurthy A, Ceze L, Nelson J. PLink: Discovering and exploiting locality for accelerated distributed training on
                      the public cloud. In: Proc. of the 3rd Conf. on Machine Learning and Systems. Austin: SysML, 2020. 82–97.
                 [80]   Wang GH, Venkataraman S, Phanishayee A, Thelin J, Devanur NR, Stoica I. Blink: Fast and generic collectives for distributed ml. In:
                      Proc. of the 3rd Conf. on Machine Learning and Systems. Austin: SysML, 2020. 172–186.
                 [81]   Cai ZX, Liu ZY, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In: Proc. of
                      the  26th  ACM  SIGPLAN  Symp.  on  Principles  and  Practice  of  Parallel  Programming.  ACM,  2021.  62–75.  [doi:  10.1145/3437801.
                      3441620]
                 [82]   Zhuang  HP,  Wang  Y,  Liu  QL,  Lin  ZP.  Fully  decoupled  neural  network  learning  using  delayed  gradients.  IEEE  Trans.  on  Neural
                      Networks and Learning Systems, 2022, 33(10): 6013–6020. [doi: 10.1109/TNNLS.2021.3069883]
                 [83]   Hashemi SH, Abdu Jyothi S, Campbell RH. TicTac: Accelerating distributed deep learning with communication scheduling. In: Proc. of
                      the 2nd Conf. on Machine Learning and Systems. Stanford: SysML, 2019. 418–430.
   225   226   227   228   229   230   231   232   233   234   235