Page 231 - 《软件学报》2026年第1期
P. 231

228                                                        软件学报  2026  年第  37  卷第  1  期


                 [84]   Romero J, Yin JQ, Laanait N, Xie B, Young MT, Treichler S, Starchenko V, Borisevich AY, Sergeev A, Matheson MA. Accelerating
                      collective communication in data parallel training across deep learning frameworks. In: Proc. of the 19th USENIX Symp. on Networked
                      Systems Design and Implementation. Renton: USENIX, 2022. 1027–1040.
                 [85]   Faghri F, Tabrizian I, Markov I, Alistarh D, Roy DM, Ramezani-Kebrya A. Adaptive gradient quantization for data-parallel SGD. In:
                      Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: ACM, 2020. 267.
                 [86]   Bian S, Li DC, Wang HY, Xing EP, Venkataraman S. Does compressing activations help model parallel training? In: Proc. of the 7th
                      Annual Conf. on Machine Learning and Systems. Santa Clara: MLSys, 2024. 239–252.
                 [87]   Bayatpour M, Sarkauskas N, Subramoni H, Hashmi JM, Panda DK. BluesMPI: Efficient MPI non-blocking alltoall offloading designs
                      on modern BlueField smart NICs. In: Proc. of the 36th Int’l Conf. on High Performance Computing. Virtual Event: Springer, 2021.
                      18–37. [doi: 10.1007/978-3-030-78713-4_2]
                 [88]   Dong JB, Wang SC, Feng F, Cao Z, Pan H, Tang LB, Li PC, Li H, Ran QY, Guo YQ, Gao SY, Long X, Zhang J, Li Y, Xia ZS, Song
                      LYH, Zhang YY, Pan P, Wang GH, Jiang XW. ACCL: Architecting highly scalable distributed training systems with highly efficient
                      collective communication library. IEEE Micro, 2021, 41(5): 85–92. [doi: 10.1109/MM.2021.3091475]
                 [89]   Jiang YM, Zhu YB, Lan C, Yi BR, Cui Y, Guo CX. A unified architecture for accelerating distributed DNN training in heterogeneous
                      GPU/CPU clusters. In: Proc. of the 14th USENIX Symp. on Operating Systems Design and Implementation. USENIX, 2020. 463–479.
                 [90]   Sapio A, Canini M, Ho CY, Nelson J, Kalnis P, Kim C, Krishnamurthy A, Moshref M, Ports DRK, Richtárik P. Scaling distributed
                      machine  learning  with  in-network  aggregation.  In:  Proc.  of  the  18th  USENIX  Symp.  on  Networked  Systems  Design  and
                      Implementation. USENIX, 2021. 785–808.
                 [91]   Lao C, Le YF, Mahajan K, Chen YX, Wu WF, Akella A, Swift MM. ATP: In-network aggregation for multi-tenant learning. In: Proc. of
                      the 18th USENIX Symp. on Networked Systems Design and Implementation. USENIX, 2021. 741–761.
                 [92]   Zhang R, Xiao WC, Zhang HY, Liu Y, Lin HX, Yang M. An empirical study on program failures of deep learning jobs. In: Proc. of the
                      42nd Int’l Conf. on Software Engineering. Seoul: ACM, 2020. 1159–1170. [doi: 10.1145/3377811.3380362]
                 [93]   Gao YJ, Shi XX, Lin HX, Zhang HY, Wu H, Li R, Yang M. An empirical study on quality issues of deep learning platform. In: Proc. of
                      the 45th Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 455–466. [doi: 10.1109/ICSE-
                      SEIP58684.2023.00052]
                 [94]   Gao YJ, Li ZX, Lin HX, Zhang HY, Wu M, Yang M. Refty: Refinement types for valid deep learning models. In: Proc. of the 44th Int’l
                      Conf. on Software Engineering. Pittsburgh: ACM, 2022. 1843–1855. [doi: 10.1145/3510003.3510077]
                 [95]   Gao YJ, Gu XY, Zhang HY, Lin HX, Yang M. Runtime performance prediction for deep learning models with graph neural network. In:
                      Proc. of the 45th IEEE/ACM Int’l Conf. on Software Engineering: Software Engineering in Practice. Melbourne: IEEE, 2023. 368–380.
                      [doi: 10.1109/ICSE-SEIP58684.2023.00039]
                 [96]   Mei HQ, Qu HZ, Sun JW, Gao YJ, Lin HX, Sun GZ. GPU occupancy prediction of deep learning models using graph neural network.
                      In: Proc. of the 25th IEEE Int’l Conf. on Cluster Computing. Santa Fe: IEEE, 2023. 318–329. [doi: 10.1109/CLUSTER52292.2023.
                      00034]
                 [97]   Zhu HY, Phanishayee A, Pekhimenko G. Daydream: Accurately estimating the efficacy of optimizations for DNN training. In: Proc. of
                      the 2020 USENIX Annual Technical Conf. USENIX, 2020. 337–352.
                 [98]   OpenAI. Scaling Kubernetes to 7 500 nodes. 2021. https://openai.com/index/scaling-kubernetes-to-7500-nodes/
                 [99]   Xiong YF, Jiang YT, Yang ZY, Qu L, Zhao GS, Liu SG, Zhong D, Pinzur B, Zhang J, Wang Y, Jose J, Pourreza H, Baxter J, Datta K,
                      Ram P, Melton L, Chau J, Cheng P, Xiong YQ, Zhou LD. SuperBench: Improving cloud AI infrastructure reliability with proactive
                      validation. In: Proc. of the 2024 USENIX Annual Technical Conf. Santa Clara: USENIX, 2024. 835–850.
                 [100]   Rojas E, Kahira AN, Meneses E, Gomez LB, Badia RM. A study of checkpointing in large scale training of deep neural networks.
                      arXiv:2012.00825, 2020.
                 [101]   Wang Z, Jia Z, Zheng S, Zhang Z, Fu XW, Ng TSE, Wang YD. Gemini: Fast failure recovery in distributed training with in-memory
                      checkpoints. In: Proc. of the 29th Symp. on Operating Systems Principles. Koblenz: ACM, 2023. 364–381. [doi: 10.1145/3600006.
                      3613145]
                 [102]   Nicolae B, Li JL, Wozniak JM, Bosilca G, Dorier M, Cappello F. DeepFreeze: Towards scalable asynchronous checkpointing of deep
                      learning  models.  In:  Proc.  of  the  20th  IEEE/ACM  Int’l  Symp.  on  Cluster,  Cloud  and  Internet  Computing.  Melbourne:  IEEE,  2020.
                      172–181. [doi: 10.1109/CCGrid49817.2020.00-76]
                 [103]   Mohan J, Phanishayee A, Chidambaram V. CheckFreq: Frequent, fine-grained DNN checkpointing. In: Proc. of the 19th USENIX Conf.
                      on File and Storage Technologies. USENIX, 2021. 203–216.
                 [104]   Eisenman  A,  Matam  KK,  Ingram  S,  Mudigere  D,  Krishnamoorthi  R,  Nair  K,  Smelyanskiy  M,  Annavaram  M.  Check-N-Run:  A
   226   227   228   229   230   231   232   233   234   235   236