Page 232 - 《软件学报》2026年第1期
P. 232
高彦杰 等: 大语言模型预训练系统关键技术综述 229
checkpointing system for training deep learning recommendation models. In: Proc. of the 19th USENIX Symp. on Networked Systems
Design and Implementation. Renton: USENIX, 2022. 929–943.
[105] Jang I, Yang ZN, Zhang Z, Jin X, Chowdhury M. Oobleck: Resilient distributed training of large models using pipeline templates. In:
Proc. of the 29th Symp. on Operating Systems Principles. Koblenz: ACM, 2023. 382–395. [doi: 10.1145/3600006.3613152]
[106] PyTorch. Elastic. 2024. https://github.com/pytorch/elastic
[107] Wang QL, Sang B, Zhang HT, Tang MJ, Zhang K. DLRover: An elastic deep training extension with auto job resource
recommendation. arXiv:2304.01468, 2023.
[108] Xiao WC, Ren SR, Li Y, Zhang Y, Hou PY, Li Z, Feng YH, Lin W, Jia YQ. AntMan: Dynamic scaling on GPU clusters for deep
learning. In: Proc. of the 14th USENIX Symp. on Operating Systems Design and Implementation. USENIX, 2020. 533–548.
[109] Shukla D, Sivathanu M, Viswanatha S, et al. Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads.
arXiv:2202.07848, 2022.
附中文参考文献
[20] 马子轩, 翟季冬, 韩文弢, 陈文光, 郑纬民. 高效训练百万亿参数预训练模型的系统挑战和对策. 中兴通讯技术, 2022, 28(2): 51–58.
[doi: 10.12142/ZTETJ.202202008]
作者简介
高彦杰, 博士生, CCF 专业会员, 主要研究领域为大语言模型系统与工具, 大数据系统.
陈跃国, 博士, 教授, 博士生导师, CCF 高级会员, 主要研究领域为金融科技, 计算社会科学, 知识图谱, 语义搜索.

