Page 227 - 《软件学报》2026年第1期

P. 227

224 软件学报 2026 年第 37 卷第 1 期

中. 如何通过设计提示词让大语言模型辅助设计新的并行策略, 编写更加高效的内核; 通过日志和监控以及程序等
信息, 让大语言模型辅助诊断程序的性能缺陷, 提示下一步的优化策略; 当训练作业崩溃时, 通过大语言模型分析
错误日志, 辅助诊断、调试和修复程序缺陷等方向需要进一步探索.
6 总结

随着大语言模型应用与技术的快速发展, 各类应用层出不穷, 导致模型规模和部署迅速增长, 使得大语言模型
渗透到越来越多的行业和业务领域, 成为重要的生产要素. 本文系统地解构大语言模型的训练过程, 分别梳理了支
撑大语言模型的系统技术现状, 包括预训练系统、系统扩展性、性能和可靠性. 尽管这些技术逐渐达成了共识, 但
尚未形成统一的跨阶段、跨技术栈的协同设计与优化方案. 本文总结了各种技术在大语言模型训练中的关键作用
及其优劣势的比较. 最后, 梳理了大语言模型预训练系统当前面临的挑战, 并提出了潜在的应对方案.

References
[1] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
31st Int’l Conf. on Neural Information Processing Systems. Long Beach: ACM, 2017. 6000–6010.
[2] Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. In: Proc. of the 34th Int’l Conf. on Neural Information
Processing Systems. Vancouver: ACM, 2020. 159.
[3] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-Training of deep bidirectional Transformers for language understanding. In:
Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies. Minneapolis: ACL, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
[4] Henighan T, Kaplan J, Katz M, Chen M, Hesse C, Jackson J, Jun H, Brown TB, Dhariwal P, Gray S, Hallacy C, Mann B, Radford A,
Ramesh A, Ryder N, Ziegler DM, Schulman J, Amodei D, McCandlish S. Scaling laws for autoregressive generative modeling.
arXiv:2010.14701, 2020.
[5] Le Scao T, Fan A, Akiki C, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv:2211.05100, 2022.
[6] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A,
Grave E, Lample G. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023.
[7] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
[8] Black S, Biderman S, Hallahan E, Anthony Q, Gao L, Golding L, He H, Leahy C, McDonell K, Phang J, Pieler M, Prashanth US,
Purohit S, Reynolds L, Tow J, Wang B, Weinbach S. GPT-NeoX-20B: An open-source autoregressive language model. In: Proc. of
BigScience Episode#5 —Workshop on Challenges & Perspectives in Creating Large Language Models. Dublin: Association for
Computational Linguistics, 2022. 95–136. [doi: 10.18653/v1/2022.bigscience-1.9]
[9] Zhang SS, Roller S, Goyal N, Artetxe M, Chen MY, Chen SH, Dewan C, Diab M, Li X, Lin XV, Mihaylov T, Ott M, Shleifer S, Shuster
K, Simig D, Koura PS, Sridhar A, Wang TL, Zettlemoyer L. OPT: Open pre-trained Transformer language models. arXiv:2205.01068,
2022.
[10] Zeng AH, Liu X, Du ZX, et al. GLM-130B: An open bilingual pre-trained model. arXiv:2210.02414, 2023
[11] Rae JW, Borgeaud S, Cai T, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446,
2021.
[12] Thoppilan R, De Freitas D, Hall J, et al. LaMDA: Language models for dialog applications. arXiv:2201.08239, 2022.
[13] Du N, Huang YP, Dai AM, et al. GLaM: Efficient scaling of language models with mixture-of-experts. In: Int’l Conf. on Machine
Learning. Baltimore: ICML, 2022. 5547–5569.
[14] Chowdhery A, Narang S, Devlin J, et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research,
2023, 24(1): 240.
[15] Anil R, Dai AM, Firat O, et al. PaLM 2 technical report. arXiv:2305.10403, 2023.
[16] Zeng W, Ren XZ, Su T, et al. PanGu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation.
arXiv:2104.12369, 2021.
[17] Su H, Zhou X, Yu HJ, Shen XY, Chen YW, Zhu ZL, Yu Y, Zhou J. WeLM: A well-read pre-trained language model for Chinese.
arXiv:2209.10372, 2022.
[18] Chung HW, Hou L, Longpre S, et al. Scaling instruction-finetuned language models. The Journal of Machine Learning Research, 2024,
25(1): 70.

222 223 224 225 226 227 228 229 230 231 232