Page 182 - 《软件学报》2025年第4期

P. 182

1588 软件学报 2025 年第 36 卷第 4 期

[49] Yeung G, Borowiec D, Friday A, Harper R, Garraghan P. Towards GPU utilization prediction for cloud deep learning. In: Proc. of the
12th USENIX Workshop on Hot Topics in Cloud Computing. USENIX, 2020.
[50] Bai L, Ji WX, Li QY, Yao XL, Xin W, Zhu WY. DNNAbacus: Toward accurate computational cost prediction for deep neural networks.
arXiv:2205.12095, 2022.
[51] He X, Zhao KY, Chu XW. AutoML: A survey of the state-of-the-art. Knowledge-based Systems, 2021, 212: 106622. [doi: 10.1016/j.
knosys.2020.106622]
[52] Gao YJ, Gu XY, Zhang HY, Lin HX, Yang M. Runtime performance prediction for deep learning models with graph neural network. In:
Proc. of the 45th IEEE/ACM Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 368–380.
[doi: 10.1109/ICSE-SEIP58684.2023.00039]
[53] Yang G, Shin C, Lee J, Yoo Y, Yoo C. Prediction of the resource consumption of distributed deep learning systems. Proc. of the ACM on
Measurement and Analysis of Computing Systems, 2022, 6(2): 29. [doi: 10.1145/3530895]
[54] Yu GX, Gao YB, Golikov P, Pekhimenko G. Habitat: A runtime-based computational performance predictor for deep neural network
training. In: Proc. of the 2021 USENIX Annual Technical Conf. USENIX, 2021. 503–521.
[55] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proc.
of the 4th Int’l Conf. on Learning Representations. San Juan: ICLR, 2016. [doi: 10.48550/arXiv.1511.06434]
[56] Liu GD, Wang S, Bao YG. SEER: A time prediction model for CNNs from GPU kernel’s view. In: Proc. of the 30th Int’l Conf. on
Parallel Architectures and Compilation Techniques (PACT). Atlanta: IEEE, 2021. 173–185. [doi: 10.1109/PACT52795.2021.00020]
[57] Wang CC, Liao YC, Kao MC, Liang WY, Hung SH. PerfNet: Platform-aware performance modeling for deep neural networks. In: Proc.
of the 2020 Int’l Conf. on Research in Adaptive and Convergent Systems. Gwangju: ACM, 2020. 90–95. [doi: 10.1145/3400286.
3418245]
[58] Wang CC, Liao YC, Kao MC, Liang WY, Hung SH. Toward accurate platform-aware performance modeling for deep neural networks.
ACM SIGAPP Applied Computing Review, 2021, 21(1): 50–61. [doi: 10.1145/3477133.3477137]
[59] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley
O, Radia S, Reed B, Baldeschwieler E. Apache Hadoop YARN: Yet another resource negotiator. In: Proc. of the 4th Annual Symp. on
Cloud Computing. Santa Clara: ACM, 2013. 5. [doi: 10.1145/2523616.2523633]
[60] Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S. THEMIS: Fair and efﬁcient GPU
cluster scheduling. In: Proc. of the 17th USENIX Conf. on Networked Systems Design and Implementation. Santa Clara: USENIX, 2020.
289–304.
[61] Kargahi M, Movaghar A. A method for performance analysis of earliest-deadline-first scheduling policy. The Journal of Supercomputing,
2006, 37(2): 197–222. [doi: 10.1007/s11227-006-5944-2]
[62] Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker S, Stoica I. Dominant resource fairness: Fair allocation of multiple resource
types. In: Proc. of the 8th USENIX Symp. on Networked Systems Design and Implementation. Boston: USENIX, 2011.
[63] Chen FH, Li P, Wu C, Guo S. Hare: Exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous
GPUs. In: Proc. of the 31st Int’l Symp. on High-performance Parallel and Distributed Computing. Minneapolis: ACM, 2022. 253–264.
ACM: Minneapolis MN USA. [doi: 10.1145/3502181.3531462]
[64] Gu DD, Zhao YH, Zhong YM, Xiong YF, Han ZH, Cheng P, Yang F, Huang G, Jin X, Liu XZ. ElasticFlow: An elastic serverless
training platform for distributed deep learning. In: Proc. of the 28th ACM Int’l Conf. on Architectural Support for Programming
Languages and Operating Systems. Vancouver: ACM, 2023. 266–280. [doi: 10.1145/3575693.3575721]
[65] Zhao YH, Liu X, Liu SF, Li X, Zhu YB, Huang G, Liu XZ, Jin X. MuxFlow: Efficient and safe GPU sharing in large-scale production
deep learning clusters. arXiv:2303.13803, 2023.
[66] Wang HY, Liu ZT, Shen HY. Job scheduling for large-scale machine learning clusters. In: Proc. of the 16th Int’l Conf. on Emerging
Networking Experiments and Technologies. Barcelona: ACM, 2020. 108–120. [doi: 10.1145/3386367.3432588]
[67] Liaw R, Bhardwaj R, Dunlap L, Zou YT, Gonzalez JE, Stoica I, Tumanov A. HyperSched: Dynamic resource reallocation for model
development on a deadline. In: Proc. of the 2019 ACM Symp. on Cloud Computing. Santa Cruz: ACM, 2019. 61–73. [doi: 10.1145/
3357223.3362719]
[68] Yu ML, Wu C, Ji B, Liu J. A sum-of-ratios multi-dimensional-knapsack decomposition for DNN resource scheduling. In: Proc. of the
2021 IEEE Conf. on Computer Communications. Vancouver: IEEE, 2021. 1–10. [doi: 10.1109/INFOCOM42981.2021.9488916]
[69] Sun QX, Liu Y, Yang HL, Zhang RZ, Dun M, Li MZ, Liu XY, Xiao WC, Li Y, Luan ZZ, Qian DP. CoGNN: Efficient scheduling for
concurrent GNN training on GPUs. In: Proc. of the 2022 Int’l Conf. for High Performance Computing, Networking, Storage and
Analysis. Dallas: IEEE, 2022. 1–15. [doi: 10.1109/SC41404.2022.00044]
[70] NVIDIA. Multi-process service. 2023. https://docs.nvidia.com/deploy/mps/index.html

177 178 179 180 181 182 183 184 185 186 187