Page 181 - 《软件学报》2021年第7期
P. 181
张程博 等:面向分布式图计算作业的容错技术研究综述 2099
续支持这些复杂图计算作业的实现,工业界和学术界还出现了一些新型的分布式图计算系统——图神经网络
系统,如微软的 Neugraph [72] 和阿里巴巴的 AliGraph [73] ,专门用于大规模图神经网络的训练和推理,这进一步验
证了图计算作业愈发复杂化的发展趋势,也为作业容错提出了新的需求,面向复杂图计算作业的容错技术相关
研究正日益迫切和重要.
References:
[1] Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT. The ubiquity of large graphs and surprising challenges of graph processing. Proc.
of the VLDB Endowment, 2017,11(4):420–431. [doi: 10.1145/3164135.3164139]
[2] Sahu S, Mhedhbi A, Salihoglu S, Lin J, Özsu MT. The ubiquity of large graphs and surprising challenges of graph processing:
extended survey. The VLDB Journal, 2019,29(2-3):595–618. [doi: 10.1007/s00778-019-00548-x]
[3] Salihoglu S, Ozsu MT. Response to “scale up or scale out for graph processing”. IEEE Internet Computing, 2018,22(5):18–24.
[4] Grzegorz MG, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph
processing. In: Proc. of the 2010 ACM SIGMOD Int’l Conf. on Management of Data. New York: Association for Computing
Machinery, 2010. 135–146. [doi: 10.1145/1807167.1807184]
[5] Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM. GraphLab: A new framework for parallel machine learning.
In: Proc. of the 26th Conf. on Uncertainty in Artificial Intelligence. Catalina Island, 2010. 340–349.
[6] Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM. Distributed GraphLab: A framework for machine learning
and data mining in the cloud. Proc. of the VLDB Endowment, 2012,5(8):716–727. [doi: 10.14778/ 2212351.2212354]
[7] Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In:
Proc. of the 10th USENIX Conf. on Operating Systems Design and Implementation. USENIX Association, 2012. 17–30.
[8] Giraph. http://giraph.apache.org
[9] Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I. GraphX: Graph processing in a distributed dataflow framework.
In: Proc. of the 11th USENIX Conf. on Operating Systems Design and Implementation. USENIX Association, 2014. 599–613.
[10] Gelly. http://flink.iteblog.com/dev/libs/gelly
[11] Egwutuoha IP, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high
performance computing systems. The Journal of Supercomputing, 2013,65(3):1302–1326. [doi: 10.1007/s11227-013-0884-0]
[12] Novaković D, Vasić N, Novaković S, Kostić D, Bianchini R. DeepDive: Transparently identifying and managing performance
interference in virtualized environments. In: Proc. of the 2013 USENIX Conf. on Annual Technical Conf. USENIX Association,
2013. 219–230.
[13] Wang KJ, Jia T, Li Y. State-of-the-art survey of scheduling and resource management technology for colocation jobs. Ruan Jian
Xue Bao/Journal of Software, 2020,31(10):3100–3119 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6066.
htm [doi: 10.13328/j.cnki.jos.006066]
[14] Fried J, Ruan Z, Ousterhout A, Belay A. Caladan: Mitigating interference at microsecond timescale. In: Proc. of the 14th USENIX
Symp. on Operating Systems Design and Implementation. USENIX Association, 2020. 281–297.
[15] Bartlett J, Gray J, Horst B. Fault tolerance in Tandem computer systems. In: The Evolution of Fault-tolerant Computing. 1987,
55–76.
[16] Bartlett W, Spainhower L. Commercial fault tolerance: A tale of two systems. IEEE Trans. on Dependable and Secure Computing,
2004,1(1):87–96. [doi: 10.1109/TDSC.2004.4]
[17] Borg A, Blau W, Graetsch W, Herrmann F, Oberle W. Fault tolerance under UNIX. ACM Trans. on Computer Systems, 1989,
7(1):1–24. [doi: 10.1145/58564.58565]
[18] Zhong H, Nieh J. CRAK: Linux checkpoint/restart as a kernel module. 2001. http://systems.cs.columbia.edu/files/wpid-cucs-014-
01.pdf
[19] Agarwal S, Garg R, Gupta MS, Moreira JE. Adaptive incremental checkpointing for massively parallel systems. In: Proc. of the
18th Annual Int’l Conf. on Supercomputing. New York: Association for Computing Machinery, 2004. 277–286. [doi: 10.1145/
1006209.1006248]