Page 181 - 《软件学报》2025年第4期
P. 181

杨紫超 等: 基于性能建模的深度学习训练任务调度综述                                                      1587


                     learning. In: Proc. of the 16th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX, 2019. 485–500.
                 [28]  Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectionals for language understanding. In: Proc. of the 2019
                     Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis:
                     ACL, 2018. 4171–4186. [doi: 10.18653/v1/N19-1423]
                 [29]  Yang ZC, Wu H, Xu YJ, Wu YW, Zhong H, Zhang WB. Hydra: Deadline-aware and efficiency-oriented scheduling for deep learning
                     jobs on heterogeneous GPUs. IEEE Trans. on Computers, 2023, 72(8): 2224–2236. [doi: 10.1109/TC.2023.3242200]
                 [30]  Le TN, Sun X, Chowdhury M, Liu ZH. AlloX: Compute allocation in hybrid clusters. In: Proc. of the 15th European Conf. on Computer
                     Systems. Heraklion: ACM, 2020. 31. [doi: 10.1145/3342195.3387547]
                 [31]  Zheng HY, Xu F, Chen L, Zhou Z, Liu FM. Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural
                     network training. In: Proc. of the 48th Int’l Conf. on Parallel Processing. Kyoto: ACM, 2019. 86. [doi: 10.1145/3337821.3337873]
                 [32]  Mohan J, Phanishayee A, Kulkarni J, Chidambaram V. Looking beyond GPUs for DNN scheduling on multi-tenant clusters. In: Proc. of
                     the 16th USENIX Symp. on Operating Systems Design and Implementation. Carlsbad: USENIX, 2022. 579–596.
                 [33]  Peng YH, Bao YX, Chen YR, Wu C, Guo CX. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In: Proc. of
                     the 13th EuroSys Conf. Porto: ACM, 2018. 3. [doi: 10.1145/3190508.3190517]
                 [34]  Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proc. of the 2016

                     IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 2818–2826. [doi: 10.1109/CVPR.2016.308]
                 [35]  Sutskever  I,  Vinyals  O,  Le  QV.  Sequence  to  sequence  learning  with  neural  networks.  In:  Proc.  of  the  27th  Int’l  Conf.  on  Neural
                     Information Processing Systems. Montreal: MIT Press, 2014. 3104–3112.
                 [36]  Zheng  PF,  Pan  R,  Khan  T,  Venkataraman  S,  Akella  A.  Shockwave:  Fair  and  efficient  cluster  scheduling  for  dynamic  adaptation  in
                     machine learning. In: Proc. of the 20th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX, 2023.
                     703–723.
                 [37]  Agarwal  S,  Wang  HY,  Lee  K,  Venkataraman  S,  Papailiopoulos  D.  Adaptive  gradient  communication  via  critical  learning  regime
                     identification. In: Proc. of the 4th Machine Learning and Systems. MLSys, 2021. 55–80.
                 [38]  Qin HY, Rajbhandari S, Ruwase O, Yan F, Yang L, He YX. SimiGrad: Fine-grained adaptive batching for large scale training using
                     gradient  similarity  measurement.  In:  Proc.  of  the  34th  Int’l  Conf.  on  Neural  Information  Processing  Systems.  NeurIPS,  2021.
                     20531–20544.
                 [39]  Zhu HY, Phanishayee A, Pekhimenko G. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In: Proc. of
                     the 2020 USENIX Annual Technical Conf. USENIX, 2020. 337–352.
                 [40]  Lam  MO,  Hollingsworth  JK,  De  Supinski  BR,  Legendre  MP.  Automatically  adapting  programs  for  mixed-precision  floating-point
                     computation. In: Proc. of the 27th Int’l ACM Conf. on Int’l Conf. on Supercomputing. Eugene: ACM, 2013. 369–378. [doi: 10.1145/
                     2464996.2465018]
                 [41]  Niu W, Guan JX, Wang YZ, Agrawal G, Ren B. DNNFusion: Accelerating deep neural networks execution with advanced operator
                     fusion. In: Proc. of the 42nd ACM SIGPLAN Int’l Conf. on Programming Language Design and Implementation. ACM, 2021. 883–898.
                     [doi: 10.1145/3453483.3454083]
                 [42]  Duan  JF,  Li  XH,  Xu  P,  Zhang  XC,  Yan  SG,  Liang  Y,  Lin  DH.  Proteus:  Simulating  the  performance  of  distributed  DNN  training.
                     arXiv:2306.02267, 2023.
                 [43]  Hu  QH,  Sun  P,  Yan  SG,  Wen  YG,  Zhang  TW.  Characterization  and  prediction  of  deep  learning  workloads  in  large-scale  GPU
                     datacenters. In: Proc. of the 2021 Int’l Conf. for High Performance Computing, Networking, Storage and Analysis. St. Louis: ACM,
                     2021. 104. [doi: 10.1145/3458817.3476223]
                 [44]  Bao YX, Peng YH, Wu C. Deep learning-based job placement in distributed machine learning clusters. In: Proc. of the 2019 IEEE Conf.
                     on Computer Communications. Paris: IEEE, 2019. 505–513. [doi: 10.1109/INFOCOM.2019.8737460]
                 [45]  Graves  A,  Fernández  S,  Gomez  F,  Schmidhuber  J.  Connectionist  temporal  classification:  Labelling  unsegmented  sequence  data  with
                     recurrent  neural  networks.  In:  Proc.  of  the  23rd  Int’l  Conf.  on  Machine  Learning.  Pittsburgh:  ACM,  2006.  369–376.  [doi:  10.1145/
                     1143844.1143891]
                 [46]  Chen ZY, Quan W, Wen M, Fang JB, Yu J, Zhang CY, Luo L. Deep learning research and development platform: Characterizing and
                     scheduling with QoS guarantees on GPU clusters. IEEE Trans. on Parallel and Distributed Systems, 2020, 31(1): 34–50. [doi: 10.1109/
                     TPDS.2019.2931558]
                 [47]  Steinberg D, Colla P. CART: Classification and regression trees. The Top Ten Algorithms in Data Mining, 2009, 9: 179.
                 [48]  Yeung G, Borowiec D, Yang RY, Friday A, Harper R, Garraghan P. Horus: Interference-aware and prediction-based scheduling in deep
                     learning systems. IEEE Trans. on Parallel and Distributed Systems, 2022, 33(1): 88–100. [doi: 10.1109/TPDS.2021.3079202]
   176   177   178   179   180   181   182   183   184   185   186