Page 112 - 《软件学报》2025年第12期
P. 112

谷典典 等: 云边协同的深度学习作业调度方法                                                          5493


                    本文下一步工作可以包括: (1) 将本文要解决的问题扩展为一个多目标优化问题, 从而可以使本文的方法适用
                 于更多的场景. 例如, 在作业具有优先级的场景下, 在资源调度顺序满足作业优先级的同时, 提高作业的截止期满
                 足率. (2) 考虑集群中的节点故障和容错问题, 通过计算故障概率并预留部分的云集群资源或边缘节点的算力, 提
                 升系统的容错能力, 从而在节点出现故障的情况下也能尽可能让更多的作业在截止期之前完成. (3) 考虑更多更复
                 杂的情况, 例如, 根据不同边缘服务器上不同的空闲               GPU  数量, 设计边缘服务器上的作业使用多个            GPU  的算法.

                  7   总 结

                    边缘服务器在各类生活场景中得到越来越多的重视和使用, 但是由于用户使用边缘服务器的作业负载具有潮
                 汐性的特点, 边缘服务器的计算资源在部分时间段处于闲置状态, 没有得到充分利用. 本文设计云边协同调度系
                 统  EdgeFlow, 将这部分闲置的边缘服务器计算资源用于云计算集群中的分布式深度学习训练作业, 在提高边缘服
                 务器利用率的同时, 使得更多的截止期敏感的深度学习训练作业在其截止期之前完成. 实验结果证明, EdgeFlow
                 在提升作业的截止期满足率方面优于其他基线方法.

                 References:
                  [1]   Xu MW, Fu Z, Ma X, Zhang L, Li YN, Qian F, Wang SG, Li K, Yang JY, Liu XZ. From cloud to edge: A first look at public edge
                     platforms. In: Proc. of the 21st ACM Internet Measurement Conf. ACM, 2021. 37–53. [doi: 10.1145/3487552.3487815]
                  [2]   Zhang XM, Zhang AL, Sun JC, Zhu X, Guo YE, Qian F, Mao ZM. EMP: Edge-assisted multi-vehicle perception. In: Proc. of the 27th
                     Annual Int’l Conf. on Mobile Computing and Networking. New Orleans: ACM, 2021. 545–558. [doi: 10.1145/3447993.3483242]
                  [3]   Shi  WS,  Cao  J,  Zhang  Q,  Li  YHZ,  Xu  LY.  Edge  computing:  Vision  and  challenges.  IEEE  Internet  of  Things  Journal,  2016,  3(5):
                     637–646. [doi: 10.1109/JIOT.2016.2579198]
                  [4]   Xu MW, Zhang L, Wang SG. Position paper: Renovating edge servers with ARM SoCs. In: Proc. of the 7th IEEE/ACM Symp. on Edge
                     Computing. Seattle: IEEE, 2022. 216–223. [doi: 10.1109/SEC54971.2022.00024]
                  [5]   Xu DL, Xu MW, Lou CH, Zhang L, Huang G, Jin X, Liu XZ. SoCFlow: Efficient and scalable DNN training on SoC-clustered edge
                     servers. In: Proc. of the 29th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. San Diego:
                     ACM, 2024. 368–385. [doi: 10.1145/3617232.3624847]
                  [6]   Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S. Themis: Fair and efficient GPU
                     cluster  scheduling.  In:  Proc.  of  the  17th  USENIX  Symp.  on  Networked  Systems  Design  and  Implementation.  Santa  Clara:  USENIX
                     Association, 2020. 289–304.
                  [7]   Bérard  A,  Pietquin  O,  Besacier  L,  Servan  C.  Listen  and  translate:  A  proof  of  concept  for  end-to-end  speech-to-text  translation.
                     arXiv:1612.01744, 2016.
                  [8]   Chen CY, Seff A, Kornhauser A, Xiao JX. DeepDriving: Learning affordance for direct perception in autonomous driving. In: Proc. of
                     the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2722–2730. [doi: 10.1109/ICCV.2015.312]
                  [9]   Gu JC, Chowdhury M, Shin KG, Zhu YB, Jeon M, Qian JJ, Liu HQ, Guo CX. Tiresias: A GPU cluster manager for distributed deep
                     learning. In: Proc. of the 16th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX Association, 2019.
                     485–500.
                 [10]   Hwang C, Kim T, Kim S, Shin J, Park K. Elastic resource sharing for distributed deep learning. In: Proc. of the 18th USENIX Symp. on
                     Networked Systems Design and Implementation. USENIX Association, 2021. 721–739.
                 [11]   Xiao WC, Bhardwaj R, Ramjee R, Sivathanu M, Kwatra N, Han ZH, Patel P, Peng X, Zhao HY, Zhang QL, Yang F, Zhou LD. Gandiva:
                     Introspective  cluster  scheduling  for  deep  learning.  In:  Proc.  of  the  13th  USENIX  Symp.  on  Operating  Systems  Design  and
                     Implementation. Carlsbad: USENIX Association, 2018. 595–610.
                 [12]   Gao W, Ye ZS, Sun P, Wen YG, Zhang TW. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In: Proc. of the
                     2021 ACM Symp. on Cloud Computing. Seattle: ACM, 2021. 609–623. [doi: 10.1145/3472883.3486978]
                 [13]   Gu DD, Zhao YH, Zhong YM, Xiong YF, Han ZH, Cheng P, Yang F, Huang G, Jin X, Liu XZ. ElasticFlow: An elastic serverless
                     training  platform  for  distributed  deep  learning.  In:  Proc.  of  the  28th  ACM  Int’l  Conf.  on  Architectural  Support  for  Programming
                     Languages and Operating Systems. Vancouver: ACM, 2023. 266–280. [doi: 10.1145/3575693.3575721]
                 [14]   Weng QZ, Xiao WC, Yu YH, Wang W, Wang C, He J, Li Y, Zhang LP, Lin W, Ding Y. MLaaS in the wild: Workload analysis and
                     scheduling  in  large-scale  heterogeneous  GPU  clusters.  In:  Proc.  of  the  19th  USENIX  Symp.  on  Networked  Systems  Design  and
   107   108   109   110   111   112   113   114   115   116   117