Page 164 - 《软件学报》2025年第4期
P. 164
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
2025,36(4):1570−1589 [doi: 10.13328/j.cnki.jos.007202] [CSTR: 32375.14.jos.007202] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
*
基于性能建模的深度学习训练任务调度综述
杨紫超 1,2 , 吴 恒 2,3,4 , 吴悦文 2 , 张文博 2,3,4
1
(中国科学院大学, 北京 100049)
2
(中国科学院 软件研究所 软件工程技术研究开发中心, 北京 100190)
3
(中国科学院大学南京学院, 江苏 南京 211135)
4
(中科南京软件技术研究院, 江苏 南京 211135)
通信作者: 张文博, E-mail: zhangwenbo@otcaix.iscas.ac.cn
摘 要: 近年来, 深度学习研究成果在全球范围内得到广泛应用. 为了提高大规模深度学习模型的训练效率, 业界
通常采用建设 GPU 集群并配置高效的任务调度器的策略. 然而, 深度学习训练任务具有性能异构性和放置拓扑敏
感性等复杂性能特性. 对性能无感知的调度容易导致资源利用率低下、训练效率差等问题. 为了应对这一挑战, 近
期涌现出大量基于性能建模的深度学习训练任务调度器. 这些调度器通过构建精确的性能模型, 深入了解任务的
复杂性能特性, 并据此设计更优化的调度算法, 从而形成更高效的调度方案. 首先基于建模设计思路, 对目前调度
器使用的性能建模方法进行分类综述. 随后, 根据调度器利用性能建模的调度优化途径, 对现有的任务调度工作进
行系统性分析. 最后, 对性能建模与调度在未来的研究方向进行展望.
关键词: 深度学习训练; 性能建模; 任务调度
中图法分类号: TP18
中文引用格式: 杨紫超, 吴恒, 吴悦文, 张文博. 基于性能建模的深度学习训练任务调度综述. 软件学报, 2025, 36(4): 1570–1589.
http://www.jos.org.cn/1000-9825/7202.htm
英文引用格式: Yang ZC, Wu H, Wu YW, Zhang WB. Survey on Task Scheduling of Deep Learning Training Based on Performance
Modeling. Ruan Jian Xue Bao/Journal of Software, 2025, 36(4): 1570–1589 (in Chinese). http://www.jos.org.cn/1000-9825/7202.htm
Survey on Task Scheduling of Deep Learning Training Based on Performance Modeling
1,2 2,3,4 2 2,3,4
YANG Zi-Chao , WU Heng , WU Yue-Wen , ZHANG Wen-Bo
1
(University of Chinese Academy of Sciences, Beijing 100049, China)
2
(Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China)
3
(University of Chinese Academy of Sciences, Nanjing, Nanjing 211135, China)
4
(Nanjing Institute of Software Technology, Nanjing 211135, China)
Abstract: In recent years, research achievements in deep learning have found widespread applications globally. To enhance the training
efficiency of large-scale deep learning models, industry practices often involve constructing GPU clusters and configuring efficient task
schedulers. However, deep learning training tasks exhibit complex performance characteristics such as performance heterogeneity and
placement topological sensitivity. Scheduling without considering performance can lead to issues such as low resource utilization and poor
training efficiency. In response to this challenge, a great number of schedulers of deep learning training tasks based on performance
modeling have emerged. These schedulers, by constructing accurate performance models, delve into the intricate performance
characteristics of tasks. Based on this understanding, they design more optimized scheduling algorithms, thereby forming more efficient
scheduling solutions. This study begins with a modeling design perspective, providing a categorized review of the performance modeling
methods employed by current schedulers. Subsequently, based on the optimized scheduling approaches from performance modeling by
schedulers, a systematic analysis of existing task scheduling efforts is presented. Finally, this study outlines prospective research directions
* 基金项目: 山东省重大创新工程 (2021CXGC010101); 国家自然科学基金 (62302489)
收稿时间: 2023-09-25; 修改时间: 2023-11-06, 2024-02-07; 采用时间: 2024-04-09; jos 在线出版时间: 2024-06-20
CNKI 网络首发时间: 2024-06-21