Page 232 - 《软件学报》2026年第1期
P. 232

高彦杰 等: 大语言模型预训练系统关键技术综述                                                          229


                      checkpointing system for training deep learning recommendation models. In: Proc. of the 19th USENIX Symp. on Networked Systems
                      Design and Implementation. Renton: USENIX, 2022. 929–943.
                 [105]   Jang I, Yang ZN, Zhang Z, Jin X, Chowdhury M. Oobleck: Resilient distributed training of large models using pipeline templates. In:
                      Proc. of the 29th Symp. on Operating Systems Principles. Koblenz: ACM, 2023. 382–395. [doi: 10.1145/3600006.3613152]
                 [106]   PyTorch. Elastic. 2024. https://github.com/pytorch/elastic
                 [107]   Wang  QL,  Sang  B,  Zhang  HT,  Tang  MJ,  Zhang  K.  DLRover:  An  elastic  deep  training  extension  with  auto  job  resource
                      recommendation. arXiv:2304.01468, 2023.
                 [108]   Xiao WC, Ren SR, Li Y, Zhang Y, Hou PY, Li Z, Feng YH, Lin W, Jia YQ. AntMan: Dynamic scaling on GPU clusters for deep
                      learning. In: Proc. of the 14th USENIX Symp. on Operating Systems Design and Implementation. USENIX, 2020. 533–548.
                 [109]   Shukla  D,  Sivathanu  M,  Viswanatha  S,  et  al.  Singularity:  Planet-scale,  preemptive  and  elastic  scheduling  of  AI  workloads.
                      arXiv:2202.07848, 2022.

                 附中文参考文献
                 [20]   马子轩, 翟季冬, 韩文弢, 陈文光, 郑纬民. 高效训练百万亿参数预训练模型的系统挑战和对策. 中兴通讯技术, 2022, 28(2): 51–58.
                     [doi: 10.12142/ZTETJ.202202008]

                 作者简介
                 高彦杰, 博士生, CCF  专业会员, 主要研究领域为大语言模型系统与工具, 大数据系统.
                 陈跃国, 博士, 教授, 博士生导师, CCF  高级会员, 主要研究领域为金融科技, 计算社会科学, 知识图谱, 语义搜索.
   227   228   229   230   231   232   233   234   235   236   237