Page 182 - 《软件学报》2025年第4期
P. 182

1588                                                       软件学报  2025  年第  36  卷第  4  期


                 [49]  Yeung G, Borowiec D, Friday A, Harper R, Garraghan P. Towards GPU utilization prediction for cloud deep learning. In: Proc. of the
                     12th USENIX Workshop on Hot Topics in Cloud Computing. USENIX, 2020.
                 [50]  Bai L, Ji WX, Li QY, Yao XL, Xin W, Zhu WY. DNNAbacus: Toward accurate computational cost prediction for deep neural networks.
                     arXiv:2205.12095, 2022.
                 [51]  He X, Zhao KY, Chu XW. AutoML: A survey of the state-of-the-art. Knowledge-based Systems, 2021, 212: 106622. [doi: 10.1016/j.
                     knosys.2020.106622]
                 [52]  Gao YJ, Gu XY, Zhang HY, Lin HX, Yang M. Runtime performance prediction for deep learning models with graph neural network. In:
                     Proc. of the 45th IEEE/ACM Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 368–380.
                     [doi: 10.1109/ICSE-SEIP58684.2023.00039]
                 [53]  Yang G, Shin C, Lee J, Yoo Y, Yoo C. Prediction of the resource consumption of distributed deep learning systems. Proc. of the ACM on
                     Measurement and Analysis of Computing Systems, 2022, 6(2): 29. [doi: 10.1145/3530895]
                 [54]  Yu GX, Gao YB, Golikov P, Pekhimenko G. Habitat: A runtime-based computational performance predictor for deep neural network
                     training. In: Proc. of the 2021 USENIX Annual Technical Conf. USENIX, 2021. 503–521.
                 [55]  Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proc.
                     of the 4th Int’l Conf. on Learning Representations. San Juan: ICLR, 2016. [doi: 10.48550/arXiv.1511.06434]
                 [56]  Liu GD, Wang S, Bao YG. SEER: A time prediction model for CNNs from GPU kernel’s view. In: Proc. of the 30th Int’l Conf. on
                     Parallel Architectures and Compilation Techniques (PACT). Atlanta: IEEE, 2021. 173–185. [doi: 10.1109/PACT52795.2021.00020]
                 [57]  Wang CC, Liao YC, Kao MC, Liang WY, Hung SH. PerfNet: Platform-aware performance modeling for deep neural networks. In: Proc.
                     of  the  2020  Int’l  Conf.  on  Research  in  Adaptive  and  Convergent  Systems.  Gwangju:  ACM,  2020.  90–95.  [doi:  10.1145/3400286.
                     3418245]
                 [58]  Wang CC, Liao YC, Kao MC, Liang WY, Hung SH. Toward accurate platform-aware performance modeling for deep neural networks.
                     ACM SIGAPP Applied Computing Review, 2021, 21(1): 50–61. [doi: 10.1145/3477133.3477137]
                 [59]  Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley
                     O, Radia S, Reed B, Baldeschwieler E. Apache Hadoop YARN: Yet another resource negotiator. In: Proc. of the 4th Annual Symp. on
                     Cloud Computing. Santa Clara: ACM, 2013. 5. [doi: 10.1145/2523616.2523633]
                 [60]  Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S. THEMIS: Fair and efficient GPU
                     cluster scheduling. In: Proc. of the 17th USENIX Conf. on Networked Systems Design and Implementation. Santa Clara: USENIX, 2020.
                     289–304.
                 [61]  Kargahi M, Movaghar A. A method for performance analysis of earliest-deadline-first scheduling policy. The Journal of Supercomputing,
                     2006, 37(2): 197–222. [doi: 10.1007/s11227-006-5944-2]
                 [62]  Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker S, Stoica I. Dominant resource fairness: Fair allocation of multiple resource
                     types. In: Proc. of the 8th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX, 2011.
                 [63]  Chen FH, Li P, Wu C, Guo S. Hare: Exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous
                     GPUs. In: Proc. of the 31st Int’l Symp. on High-performance Parallel and Distributed Computing. Minneapolis: ACM, 2022. 253–264.
                     ACM: Minneapolis MN USA. [doi: 10.1145/3502181.3531462]
                 [64]  Gu DD, Zhao YH, Zhong YM, Xiong YF, Han ZH, Cheng P, Yang F, Huang G, Jin X, Liu XZ. ElasticFlow: An elastic serverless
                     training  platform  for  distributed  deep  learning.  In:  Proc.  of  the  28th  ACM  Int’l  Conf.  on  Architectural  Support  for  Programming
                     Languages and Operating Systems. Vancouver: ACM, 2023. 266–280. [doi: 10.1145/3575693.3575721]
                 [65]  Zhao YH, Liu X, Liu SF, Li X, Zhu YB, Huang G, Liu XZ, Jin X. MuxFlow: Efficient and safe GPU sharing in large-scale production
                     deep learning clusters. arXiv:2303.13803, 2023.
                 [66]  Wang HY, Liu ZT, Shen HY. Job scheduling for large-scale machine learning clusters. In: Proc. of the 16th Int’l Conf. on Emerging
                     Networking Experiments and Technologies. Barcelona: ACM, 2020. 108–120. [doi: 10.1145/3386367.3432588]
                 [67]  Liaw R, Bhardwaj R, Dunlap L, Zou YT, Gonzalez JE, Stoica I, Tumanov A. HyperSched: Dynamic resource reallocation for model
                     development on a deadline. In: Proc. of the 2019 ACM Symp. on Cloud Computing. Santa Cruz: ACM, 2019. 61–73. [doi: 10.1145/
                     3357223.3362719]
                 [68]  Yu ML, Wu C, Ji B, Liu J. A sum-of-ratios multi-dimensional-knapsack decomposition for DNN resource scheduling. In: Proc. of the
                     2021 IEEE Conf. on Computer Communications. Vancouver: IEEE, 2021. 1–10. [doi: 10.1109/INFOCOM42981.2021.9488916]
                 [69]  Sun QX, Liu Y, Yang HL, Zhang RZ, Dun M, Li MZ, Liu XY, Xiao WC, Li Y, Luan ZZ, Qian DP. CoGNN: Efficient scheduling for
                     concurrent  GNN  training  on  GPUs.  In:  Proc.  of  the  2022  Int’l  Conf.  for  High  Performance  Computing,  Networking,  Storage  and
                     Analysis. Dallas: IEEE, 2022. 1–15. [doi: 10.1109/SC41404.2022.00044]
                 [70]  NVIDIA. Multi-process service. 2023. https://docs.nvidia.com/deploy/mps/index.html
   177   178   179   180   181   182   183   184   185   186   187