Page 112 - 《软件学报》2025年第12期
P. 112
谷典典 等: 云边协同的深度学习作业调度方法 5493
本文下一步工作可以包括: (1) 将本文要解决的问题扩展为一个多目标优化问题, 从而可以使本文的方法适用
于更多的场景. 例如, 在作业具有优先级的场景下, 在资源调度顺序满足作业优先级的同时, 提高作业的截止期满
足率. (2) 考虑集群中的节点故障和容错问题, 通过计算故障概率并预留部分的云集群资源或边缘节点的算力, 提
升系统的容错能力, 从而在节点出现故障的情况下也能尽可能让更多的作业在截止期之前完成. (3) 考虑更多更复
杂的情况, 例如, 根据不同边缘服务器上不同的空闲 GPU 数量, 设计边缘服务器上的作业使用多个 GPU 的算法.
7 总 结
边缘服务器在各类生活场景中得到越来越多的重视和使用, 但是由于用户使用边缘服务器的作业负载具有潮
汐性的特点, 边缘服务器的计算资源在部分时间段处于闲置状态, 没有得到充分利用. 本文设计云边协同调度系
统 EdgeFlow, 将这部分闲置的边缘服务器计算资源用于云计算集群中的分布式深度学习训练作业, 在提高边缘服
务器利用率的同时, 使得更多的截止期敏感的深度学习训练作业在其截止期之前完成. 实验结果证明, EdgeFlow
在提升作业的截止期满足率方面优于其他基线方法.
References:
[1] Xu MW, Fu Z, Ma X, Zhang L, Li YN, Qian F, Wang SG, Li K, Yang JY, Liu XZ. From cloud to edge: A first look at public edge
platforms. In: Proc. of the 21st ACM Internet Measurement Conf. ACM, 2021. 37–53. [doi: 10.1145/3487552.3487815]
[2] Zhang XM, Zhang AL, Sun JC, Zhu X, Guo YE, Qian F, Mao ZM. EMP: Edge-assisted multi-vehicle perception. In: Proc. of the 27th
Annual Int’l Conf. on Mobile Computing and Networking. New Orleans: ACM, 2021. 545–558. [doi: 10.1145/3447993.3483242]
[3] Shi WS, Cao J, Zhang Q, Li YHZ, Xu LY. Edge computing: Vision and challenges. IEEE Internet of Things Journal, 2016, 3(5):
637–646. [doi: 10.1109/JIOT.2016.2579198]
[4] Xu MW, Zhang L, Wang SG. Position paper: Renovating edge servers with ARM SoCs. In: Proc. of the 7th IEEE/ACM Symp. on Edge
Computing. Seattle: IEEE, 2022. 216–223. [doi: 10.1109/SEC54971.2022.00024]
[5] Xu DL, Xu MW, Lou CH, Zhang L, Huang G, Jin X, Liu XZ. SoCFlow: Efficient and scalable DNN training on SoC-clustered edge
servers. In: Proc. of the 29th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. San Diego:
ACM, 2024. 368–385. [doi: 10.1145/3617232.3624847]
[6] Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S. Themis: Fair and efficient GPU
cluster scheduling. In: Proc. of the 17th USENIX Symp. on Networked Systems Design and Implementation. Santa Clara: USENIX
Association, 2020. 289–304.
[7] Bérard A, Pietquin O, Besacier L, Servan C. Listen and translate: A proof of concept for end-to-end speech-to-text translation.
arXiv:1612.01744, 2016.
[8] Chen CY, Seff A, Kornhauser A, Xiao JX. DeepDriving: Learning affordance for direct perception in autonomous driving. In: Proc. of
the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2722–2730. [doi: 10.1109/ICCV.2015.312]
[9] Gu JC, Chowdhury M, Shin KG, Zhu YB, Jeon M, Qian JJ, Liu HQ, Guo CX. Tiresias: A GPU cluster manager for distributed deep
learning. In: Proc. of the 16th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX Association, 2019.
485–500.
[10] Hwang C, Kim T, Kim S, Shin J, Park K. Elastic resource sharing for distributed deep learning. In: Proc. of the 18th USENIX Symp. on
Networked Systems Design and Implementation. USENIX Association, 2021. 721–739.
[11] Xiao WC, Bhardwaj R, Ramjee R, Sivathanu M, Kwatra N, Han ZH, Patel P, Peng X, Zhao HY, Zhang QL, Yang F, Zhou LD. Gandiva:
Introspective cluster scheduling for deep learning. In: Proc. of the 13th USENIX Symp. on Operating Systems Design and
Implementation. Carlsbad: USENIX Association, 2018. 595–610.
[12] Gao W, Ye ZS, Sun P, Wen YG, Zhang TW. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In: Proc. of the
2021 ACM Symp. on Cloud Computing. Seattle: ACM, 2021. 609–623. [doi: 10.1145/3472883.3486978]
[13] Gu DD, Zhao YH, Zhong YM, Xiong YF, Han ZH, Cheng P, Yang F, Huang G, Jin X, Liu XZ. ElasticFlow: An elastic serverless
training platform for distributed deep learning. In: Proc. of the 28th ACM Int’l Conf. on Architectural Support for Programming
Languages and Operating Systems. Vancouver: ACM, 2023. 266–280. [doi: 10.1145/3575693.3575721]
[14] Weng QZ, Xiao WC, Yu YH, Wang W, Wang C, He J, Li Y, Zhang LP, Lin W, Ding Y. MLaaS in the wild: Workload analysis and
scheduling in large-scale heterogeneous GPU clusters. In: Proc. of the 19th USENIX Symp. on Networked Systems Design and

