Page 230 - 《软件学报》2026年第1期
P. 230
高彦杰 等: 大语言模型预训练系统关键技术综述 227
23(8): 1177–1193. [doi: 10.1109/TNNLS.2012.2200299]
[61] Hwang C, Cui W, Xiong YF, Yang ZY, Liu Z, Hu H, Wang ZL, Salas R, Jose J, Ram P, Chau H, Cheng P, Yang F, Yang M, Xiong YQ.
Tutel: Adaptive mixture-of-experts at scale. In: Proc. of the 6th Conf. on Machine Learning and Systems. Miami: MLSys, 2023.
[62] Li JM, Jiang YM, Zhu YB, Wang C, Xu H. Accelerating distributed MoE training and inference with lina. In: Proc. of the 2023
USENIX Annual Technical Conf. Boston: USENIX, 2023. 945–959.
[63] Zhai MS, He JA, Ma ZX, Zong Z, Zhang RQ, Zhai JD. SmartMoE: Efficiently training sparsely-activated models through combining
offline and online parallelization. In: Proc. of the 2023 USENIX Annual Technical Conf. Boston: USENIX, 2023. 961–975.
[64] Ivanov A, Dryden N, Ben-Nun T, Li SG, Hoefler T. Data movement is all you need: A case study on optimizing Transformers. In: Proc.
of the 4th Conf. on Machine Learning and Systems. MLSys, 2021. 711–732.
[65] Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance model for multicore architectures. Communications
of the ACM, 2009, 52(4): 65–76. [doi: 10.1145/1498765.1498785]
[66] Dao T, Fu DY, Ermon S, Rudra A, Ré C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In: Proc. of the
36th Int’l Conf. on Neural Information Processing Systems. New Orleans: ACM, 2022. 1189.
[67] Dao T. FlashAttention-2: Faster attention with better parallelism and work partitioning. In: Proc. of the 12th Int’l Conf. on Learning
Representations. Vienna: ICLR, 2024.
[68] Beltagy I, Peters ME, Cohan A. Longformer: The long-document Transformer. arXiv:2004.05150, 2020.
[69] Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient Transformer. In: Proc. of the 8th Int’l Conf. on Learning Representations.
Addis Ababa: ICLR, 2020.
[70] Wang SN, Li BZ, Khabsa M, Fang H, Ma H. Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
[71] Zhu C, Ping W, Xiao CW, Shoeybi M, Goldstein T, Anandkumar A, Catanzaro B. Long-short Transformer: Efficient Transformers for
language and vision. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. ACM, 2021. 17723–17736.
[72] Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed
precision training. arXiv:1710.03740. 2017.
[73] Liu ZC, Oguz B, Zhao CS, Chang E, Stock P, Mehdad Y, Shi YY, Krishnamoorthi R, Chandra V. LLM-QAT: Data-Free quantization
aware training for large language models. In: Proc. of the Findings of the Association for Computational Linguistics. Bangkok: ACL,
2024. 467–484. [doi: 10.18653/v1/2024.findings-acl.26]
[74] Arshia FZ, Keyvanrad MA, Sadidpour SS, Mohammadi SMR. PeQA: A massive Persian question-answering and chatbot dataset. In:
Proc. of the 12th Int’l Conf. on Computer and Knowledge Engineering (ICCKE). Mashhad: IEEE, 2022. 392–397. [doi: 10.1109/
ICCKE57176.2022.9960071]
[75] Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized LLMs. In: Proc. of the 37th Int’l Conf.
on Neural Information Processing Systems. New Orleans: ACM, 2023. 441.
[76] Wang WY, Khazraee M, Zhong ZZ, Ghobadi M, Jia ZH, Mudigere D, Zhang Y, Kewitsch A. TopoOpt: Co-Optimizing network
topology and parallelization strategy for distributed training jobs. In: Proc. of the 20th USENIX Symp. on Networked Systems Design
and Implementation. Boston: USENIX, 2023. 739–767.
[77] Cowan M, Maleki S, Musuvathi M, Saarikivi O, Xiong YF. MSCCLang: Microsoft collective communication language. In: Proc. of the
28th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. Vancouver: ACM, 2023. 502–514.
[doi: 10.1145/3575693.3575724]
[78] Cho M, Finkler U, Kung DS, Hunter HC. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy.
In: Proc. of the 2nd Conf. on Machine Learning and Systems. Stanford: SysML, 2019. 241–251.
[79] Luo L, West P, Krishnamurthy A, Ceze L, Nelson J. PLink: Discovering and exploiting locality for accelerated distributed training on
the public cloud. In: Proc. of the 3rd Conf. on Machine Learning and Systems. Austin: SysML, 2020. 82–97.
[80] Wang GH, Venkataraman S, Phanishayee A, Thelin J, Devanur NR, Stoica I. Blink: Fast and generic collectives for distributed ml. In:
Proc. of the 3rd Conf. on Machine Learning and Systems. Austin: SysML, 2020. 172–186.
[81] Cai ZX, Liu ZY, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In: Proc. of
the 26th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming. ACM, 2021. 62–75. [doi: 10.1145/3437801.
3441620]
[82] Zhuang HP, Wang Y, Liu QL, Lin ZP. Fully decoupled neural network learning using delayed gradients. IEEE Trans. on Neural
Networks and Learning Systems, 2022, 33(10): 6013–6020. [doi: 10.1109/TNNLS.2021.3069883]
[83] Hashemi SH, Abdu Jyothi S, Campbell RH. TicTac: Accelerating distributed deep learning with communication scheduling. In: Proc. of
the 2nd Conf. on Machine Learning and Systems. Stanford: SysML, 2019. 418–430.

