Page 183 - 《软件学报》2025年第4期
P. 183
杨紫超 等: 基于性能建模的深度学习训练任务调度综述 1589
[71] NVIDIA. NVIDIA multi-instance GPU user guide. 2023. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
[72] Gu DD, Xie XT, Huang G, Jin X, Liu XZ. Energy-efficient GPU clusters scheduling for deep learning. arXiv:2304.06381, 2023.
[73] Zhao HY, Han ZH, Yang Z, Zhang QL, Yang F, Zhou LD, Yang M, Lau FCM, Wang YQ, Xiong YF, Wang B. HiveD: Sharing a GPU
cluster for deep learning with guarantees. In: Proc. of the 14th USENIX Symp. on Operating Systems Design and Implementation.
USENIX, 2020. 515–532.
[74] Shukla D, Sivathanu M, Viswanatha S, et al. Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads.
arXiv:2202.07848, 2022.
[75] Wang SQ, Gonzalez OJ, Zhou XB, Williams T, Friedman BD, Havemann M, Woo T. An efficient and non-intrusive GPU scheduling
framework for deep learning training systems. In: Proc. of the 2020 Int’l Conf. for High Performance Computing, Networking, Storage
and Analysis. Atlanta: IEEE, 2020. 1–3. [doi: 10.1109/SC41405.2020.00094]
[76] Yeh TA, Chen HH, Chou J. KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud. In: Proc. of
the 29th Int’l Symp. on High-performance Parallel and Distributed Computing. Stockholm: ACM, 2020. 173–184. [doi: 10.1145/3369583.
3392679]
[77] Gu J, Song SB, Li Y, Luo HM. GaiaGPU: Sharing GPUs in container clouds. In: Proc. of the 2018 IEEE Int’l Conf. on Parallel &
Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing
& Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). Melbourne: IEEE, 2018.
吴恒(1983-), 男, 博士, 副研究员, 主要研究领
469–476. [doi: 10.1109/BDCloud.2018.00077]
[78] Wu BY, Zhang ZL, Bai ZH, Liu XZ, Jin X. Transparent GPU sharing in container clouds for deep learning workloads. In: Proc. of the
20th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX, 2023. 69–85.
[79] ALIBABA. Alibaba cloud elastic GPU service best practice. 2023. https://static-aliyun-doc.oss-cn-hangzhou.aliyuncs.com/download%
2Fpdf%2F163835%2FBest_Practices_reseller_en-US.pdf
[80] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the
31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
[81] OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
[82] Baidu. ERNIE bot. 2023. https://yiyan.baidu.com/
附中文参考文献:
[4] 刘宇宸, 宗成庆. 跨模态信息融合的端到端语音翻译. 软件学报, 2023, 34(4): 1837–1849. http://www.jos.org.cn/1000-9825/6413.htm
[doi: 10.13328/j.cnki.jos.006413]
[11] 宋杰, 孙宗哲, 毛克明, 鲍玉斌, 于戈. MapReduce 大数据处理平台与算法研究进展. 软件学报, 2017, 28(3): 514–543. http://www.jos.
org.cn/1000-9825/5169.htm [doi: 10.13328/j.cnki.jos.005169]
[14] 任杰, 高岭, 于佳龙, 袁璐. 面向边缘设备的高能效深度学习任务调度策略. 计算机学报, 2020, 43(3): 440–452. [doi: 10.11897/
SP.J.1016.2020.00440]
[17] 高赫然, 吴恒, 许源佳, 李修和, 王焘, 张文博. 面向深度学习训练的内存交换机制综述. 软件学报, 2023, 34(12): 5862–5886. http://
www.jos.org.cn/1000-9825/6800.htm [doi: 10.13328/j.cnki.jos.006800]
[82] 百度. 文心一言. 2023. https://yiyan.baidu.com/
杨紫超(1999-), 男, 博士生, 主要研究领域为资 吴悦文(1990-), 男, 博士, CCF 专业会员, 主要
源调度, 分布式系统. 研究领域为云计算, 容量规划.
张文博(1976-), 男, 博士, 研究员, 博士生导师,
域为容器虚拟化, 边缘计算. 主要研究领域为云计算, 服务计算.