Page 227 - 《软件学报》2026年第1期
P. 227

224                                                        软件学报  2026  年第  37  卷第  1  期


                 中. 如何通过设计提示词让大语言模型辅助设计新的并行策略, 编写更加高效的内核; 通过日志和监控以及程序等
                 信息, 让大语言模型辅助诊断程序的性能缺陷, 提示下一步的优化策略; 当训练作业崩溃时, 通过大语言模型分析
                 错误日志, 辅助诊断、调试和修复程序缺陷等方向需要进一步探索.
                  6   总 结

                    随着大语言模型应用与技术的快速发展, 各类应用层出不穷, 导致模型规模和部署迅速增长, 使得大语言模型
                 渗透到越来越多的行业和业务领域, 成为重要的生产要素. 本文系统地解构大语言模型的训练过程, 分别梳理了支
                 撑大语言模型的系统技术现状, 包括预训练系统、系统扩展性、性能和可靠性. 尽管这些技术逐渐达成了共识, 但
                 尚未形成统一的跨阶段、跨技术栈的协同设计与优化方案. 本文总结了各种技术在大语言模型训练中的关键作用
                 及其优劣势的比较. 最后, 梳理了大语言模型预训练系统当前面临的挑战, 并提出了潜在的应对方案.

                 References
                  [1]   Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
                      31st Int’l Conf. on Neural Information Processing Systems. Long Beach: ACM, 2017. 6000–6010.
                  [2]   Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. In: Proc. of the 34th Int’l Conf. on Neural Information
                      Processing Systems. Vancouver: ACM, 2020. 159.
                  [3]   Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-Training of deep bidirectional Transformers for language understanding. In:
                      Proc.  of  the  2019  Conf.  of  the  North  American  Chapter  of  the  Association  for  Computational  Linguistics:  Human  Language
                      Technologies. Minneapolis: ACL, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
                  [4]   Henighan T, Kaplan J, Katz M, Chen M, Hesse C, Jackson J, Jun H, Brown TB, Dhariwal P, Gray S, Hallacy C, Mann B, Radford A,
                      Ramesh  A,  Ryder  N,  Ziegler  DM,  Schulman  J,  Amodei  D,  McCandlish  S.  Scaling  laws  for  autoregressive  generative  modeling.
                      arXiv:2010.14701, 2020.
                  [5]   Le Scao T, Fan A, Akiki C, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv:2211.05100, 2022.
                  [6]   Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A,
                      Grave E, Lample G. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023.
                  [7]   Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
                  [8]   Black S, Biderman S, Hallahan E, Anthony Q, Gao L, Golding L, He H, Leahy C, McDonell K, Phang J, Pieler M, Prashanth US,
                      Purohit S, Reynolds L, Tow J, Wang B, Weinbach S. GPT-NeoX-20B: An open-source autoregressive language model. In: Proc. of
                      BigScience  Episode#5 —Workshop  on  Challenges  &  Perspectives  in  Creating  Large  Language  Models.  Dublin:  Association  for
                      Computational Linguistics, 2022. 95–136. [doi: 10.18653/v1/2022.bigscience-1.9]
                  [9]   Zhang SS, Roller S, Goyal N, Artetxe M, Chen MY, Chen SH, Dewan C, Diab M, Li X, Lin XV, Mihaylov T, Ott M, Shleifer S, Shuster
                      K, Simig D, Koura PS, Sridhar A, Wang TL, Zettlemoyer L. OPT: Open pre-trained Transformer language models. arXiv:2205.01068,
                      2022.
                 [10]   Zeng AH, Liu X, Du ZX, et al. GLM-130B: An open bilingual pre-trained model. arXiv:2210.02414, 2023
                 [11]   Rae JW, Borgeaud S, Cai T, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446,
                      2021.
                 [12]   Thoppilan R, De Freitas D, Hall J, et al. LaMDA: Language models for dialog applications. arXiv:2201.08239, 2022.
                 [13]   Du N, Huang YP, Dai AM, et al. GLaM: Efficient scaling of language models with mixture-of-experts. In: Int’l Conf. on Machine
                      Learning. Baltimore: ICML, 2022. 5547–5569.
                 [14]   Chowdhery A, Narang S, Devlin J, et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research,
                      2023, 24(1): 240.
                 [15]   Anil R, Dai AM, Firat O, et al. PaLM 2 technical report. arXiv:2305.10403, 2023.
                 [16]   Zeng W, Ren XZ, Su T, et al. PanGu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation.
                      arXiv:2104.12369, 2021.
                 [17]   Su H, Zhou X, Yu HJ, Shen XY, Chen YW, Zhu ZL, Yu Y, Zhou J. WeLM: A well-read pre-trained language model for Chinese.
                      arXiv:2209.10372, 2022.
                 [18]   Chung HW, Hou L, Longpre S, et al. Scaling instruction-finetuned language models. The Journal of Machine Learning Research, 2024,
                      25(1): 70.
   222   223   224   225   226   227   228   229   230   231   232