Page 231 - 《软件学报》2026年第1期

P. 231

228 软件学报 2026 年第 37 卷第 1 期

[84] Romero J, Yin JQ, Laanait N, Xie B, Young MT, Treichler S, Starchenko V, Borisevich AY, Sergeev A, Matheson MA. Accelerating
collective communication in data parallel training across deep learning frameworks. In: Proc. of the 19th USENIX Symp. on Networked
Systems Design and Implementation. Renton: USENIX, 2022. 1027–1040.
[85] Faghri F, Tabrizian I, Markov I, Alistarh D, Roy DM, Ramezani-Kebrya A. Adaptive gradient quantization for data-parallel SGD. In:
Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 267.
[86] Bian S, Li DC, Wang HY, Xing EP, Venkataraman S. Does compressing activations help model parallel training? In: Proc. of the 7th
Annual Conf. on Machine Learning and Systems. Santa Clara: MLSys, 2024. 239–252.
[87] Bayatpour M, Sarkauskas N, Subramoni H, Hashmi JM, Panda DK. BluesMPI: Efficient MPI non-blocking alltoall offloading designs
on modern BlueField smart NICs. In: Proc. of the 36th Int’l Conf. on High Performance Computing. Virtual Event: Springer, 2021.
18–37. [doi: 10.1007/978-3-030-78713-4_2]
[88] Dong JB, Wang SC, Feng F, Cao Z, Pan H, Tang LB, Li PC, Li H, Ran QY, Guo YQ, Gao SY, Long X, Zhang J, Li Y, Xia ZS, Song
LYH, Zhang YY, Pan P, Wang GH, Jiang XW. ACCL: Architecting highly scalable distributed training systems with highly efficient
collective communication library. IEEE Micro, 2021, 41(5): 85–92. [doi: 10.1109/MM.2021.3091475]
[89] Jiang YM, Zhu YB, Lan C, Yi BR, Cui Y, Guo CX. A unified architecture for accelerating distributed DNN training in heterogeneous
GPU/CPU clusters. In: Proc. of the 14th USENIX Symp. on Operating Systems Design and Implementation. USENIX, 2020. 463–479.
[90] Sapio A, Canini M, Ho CY, Nelson J, Kalnis P, Kim C, Krishnamurthy A, Moshref M, Ports DRK, Richtárik P. Scaling distributed
machine learning with in-network aggregation. In: Proc. of the 18th USENIX Symp. on Networked Systems Design and
Implementation. USENIX, 2021. 785–808.
[91] Lao C, Le YF, Mahajan K, Chen YX, Wu WF, Akella A, Swift MM. ATP: In-network aggregation for multi-tenant learning. In: Proc. of
the 18th USENIX Symp. on Networked Systems Design and Implementation. USENIX, 2021. 741–761.
[92] Zhang R, Xiao WC, Zhang HY, Liu Y, Lin HX, Yang M. An empirical study on program failures of deep learning jobs. In: Proc. of the
42nd Int’l Conf. on Software Engineering. Seoul: ACM, 2020. 1159–1170. [doi: 10.1145/3377811.3380362]
[93] Gao YJ, Shi XX, Lin HX, Zhang HY, Wu H, Li R, Yang M. An empirical study on quality issues of deep learning platform. In: Proc. of
the 45th Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 455–466. [doi: 10.1109/ICSE-
SEIP58684.2023.00052]
[94] Gao YJ, Li ZX, Lin HX, Zhang HY, Wu M, Yang M. Refty: Refinement types for valid deep learning models. In: Proc. of the 44th Int’l
Conf. on Software Engineering. Pittsburgh: ACM, 2022. 1843–1855. [doi: 10.1145/3510003.3510077]
[95] Gao YJ, Gu XY, Zhang HY, Lin HX, Yang M. Runtime performance prediction for deep learning models with graph neural network. In:
Proc. of the 45th IEEE/ACM Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 368–380.
[doi: 10.1109/ICSE-SEIP58684.2023.00039]
[96] Mei HQ, Qu HZ, Sun JW, Gao YJ, Lin HX, Sun GZ. GPU occupancy prediction of deep learning models using graph neural network.
In: Proc. of the 25th IEEE Int’l Conf. on Cluster Computing. Santa Fe: IEEE, 2023. 318–329. [doi: 10.1109/CLUSTER52292.2023.
00034]
[97] Zhu HY, Phanishayee A, Pekhimenko G. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In: Proc. of
the 2020 USENIX Annual Technical Conf. USENIX, 2020. 337–352.
[98] OpenAI. Scaling Kubernetes to 7 500 nodes. 2021. https://openai.com/index/scaling-kubernetes-to-7500-nodes/
[99] Xiong YF, Jiang YT, Yang ZY, Qu L, Zhao GS, Liu SG, Zhong D, Pinzur B, Zhang J, Wang Y, Jose J, Pourreza H, Baxter J, Datta K,
Ram P, Melton L, Chau J, Cheng P, Xiong YQ, Zhou LD. SuperBench: Improving cloud AI infrastructure reliability with proactive
validation. In: Proc. of the 2024 USENIX Annual Technical Conf. Santa Clara: USENIX, 2024. 835–850.
[100] Rojas E, Kahira AN, Meneses E, Gomez LB, Badia RM. A study of checkpointing in large scale training of deep neural networks.
arXiv:2012.00825, 2020.
[101] Wang Z, Jia Z, Zheng S, Zhang Z, Fu XW, Ng TSE, Wang YD. Gemini: Fast failure recovery in distributed training with in-memory
checkpoints. In: Proc. of the 29th Symp. on Operating Systems Principles. Koblenz: ACM, 2023. 364–381. [doi: 10.1145/3600006.
3613145]
[102] Nicolae B, Li JL, Wozniak JM, Bosilca G, Dorier M, Cappello F. DeepFreeze: Towards scalable asynchronous checkpointing of deep
learning models. In: Proc. of the 20th IEEE/ACM Int’l Symp. on Cluster, Cloud and Internet Computing. Melbourne: IEEE, 2020.
172–181. [doi: 10.1109/CCGrid49817.2020.00-76]
[103] Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing. In: Proc. of the 19th USENIX Conf.
on File and Storage Technologies. USENIX, 2021. 203–216.
[104] Eisenman A, Matam KK, Ingram S, Mudigere D, Krishnamoorthi R, Nair K, Smelyanskiy M, Annavaram M. Check-N-Run: A

226 227 228 229 230 231 232 233 234 235 236