Page 181 - 《软件学报》2021年第7期
P. 181

张程博  等:面向分布式图计算作业的容错技术研究综述                                                      2099


                 续支持这些复杂图计算作业的实现,工业界和学术界还出现了一些新型的分布式图计算系统——图神经网络
                 系统,如微软的 Neugraph    [72] 和阿里巴巴的 AliGraph [73] ,专门用于大规模图神经网络的训练和推理,这进一步验
                 证了图计算作业愈发复杂化的发展趋势,也为作业容错提出了新的需求,面向复杂图计算作业的容错技术相关
                 研究正日益迫切和重要.


                 References:
                 [1]    Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT. The ubiquity of large graphs and surprising challenges of graph processing. Proc.
                     of the VLDB Endowment, 2017,11(4):420–431. [doi: 10.1145/3164135.3164139]
                 [2]    Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT. The ubiquity of large graphs and surprising challenges of graph processing:
                     extended survey. The VLDB Journal, 2019,29(2-3):595–618. [doi: 10.1007/s00778-019-00548-x]
                 [3]    Salihoglu S, Ozsu MT. Response to “scale up or scale out for graph processing”. IEEE Internet Computing, 2018,22(5):18–24.
                 [4]    Grzegorz MG, Austern MH,  Bik AJC, Dehnert  JC, Horn  I, Leiser N, Czajkowski  G.  Pregel: A  system  for  large-scale  graph
                     processing. In:  Proc.  of  the  2010 ACM SIGMOD  Int’l Conf.  on Management  of Data. New York: Association  for Computing
                     Machinery, 2010. 135–146. [doi: 10.1145/1807167.1807184]
                 [5]    Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. GraphLab: A new framework for parallel machine learning.
                     In: Proc. of the 26th Conf. on Uncertainty in Artificial Intelligence. Catalina Island, 2010. 340–349.
                 [6]    Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM. Distributed GraphLab: A framework for machine learning
                     and data mining in the cloud. Proc. of the VLDB Endowment, 2012,5(8):716–727. [doi: 10.14778/ 2212351.2212354]
                 [7]    Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In:
                     Proc. of the 10th USENIX Conf. on Operating Systems Design and Implementation. USENIX Association, 2012. 17–30.
                 [8]    Giraph. http://giraph.apache.org
                 [9]    Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX: Graph processing in a distributed dataflow framework.
                     In: Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX Association, 2014. 599–613.
                [10]    Gelly. http://flink.iteblog.com/dev/libs/gelly
                [11]    Egwutuoha IP, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high
                     performance computing systems. The Journal of Supercomputing, 2013,65(3):1302–1326. [doi: 10.1007/s11227-013-0884-0]
                [12]    Novaković D, Vasić N, Novaković S, Kostić D, Bianchini R. DeepDive: Transparently  identifying and managing  performance
                     interference in virtualized environments. In: Proc. of the 2013 USENIX Conf. on Annual Technical Conf. USENIX Association,
                     2013. 219–230.
                [13]    Wang KJ, Jia T, Li Y. State-of-the-art survey of scheduling and resource management technology for colocation jobs. Ruan Jian
                     Xue Bao/Journal of Software, 2020,31(10):3100–3119 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6066.
                     htm [doi: 10.13328/j.cnki.jos.006066]
                [14]    Fried J, Ruan Z, Ousterhout A, Belay A. Caladan: Mitigating interference at microsecond timescale. In: Proc. of the 14th USENIX
                     Symp. on Operating Systems Design and Implementation. USENIX Association, 2020. 281–297.
                [15]    Bartlett J, Gray J, Horst B. Fault tolerance in Tandem  computer systems. In: The Evolution of Fault-tolerant Computing. 1987,
                     55–76.
                [16]    Bartlett W, Spainhower L. Commercial fault tolerance: A tale of two systems. IEEE Trans. on Dependable and Secure Computing,
                     2004,1(1):87–96. [doi: 10.1109/TDSC.2004.4]
                [17]    Borg A, Blau W, Graetsch W, Herrmann F, Oberle W. Fault tolerance under UNIX. ACM Trans. on Computer Systems, 1989,
                     7(1):1–24. [doi: 10.1145/58564.58565]
                [18]    Zhong H, Nieh J. CRAK: Linux checkpoint/restart as a kernel module. 2001. http://systems.cs.columbia.edu/files/wpid-cucs-014-
                     01.pdf
                [19]    Agarwal S, Garg R, Gupta MS, Moreira JE. Adaptive incremental checkpointing for massively parallel systems. In: Proc. of the
                     18th Annual Int’l Conf.  on  Supercomputing. New  York: Association  for Computing Machinery,  2004.  277–286.  [doi: 10.1145/
                     1006209.1006248]
   176   177   178   179   180   181   182   183   184   185   186