Page 183 - 《软件学报》2021年第7期
P. 183
张程博 等:面向分布式图计算作业的容错技术研究综述 2101
[40] Salfner F, Lenk M, Malek M. A survey of online failure prediction methods. ACM Computing Surveys, 2010,42(3):1–42.
[41] Wang Z, Gu Y, Bao Y, Yu G, Gao L. An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations.
Distributed and Parallel Databases, 2017,35(2):177–196.
[42] Jhawar R, Piuri V, Santambrogio M. A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In:
Proc. of the 2012 IEEE Int’l Systems Conf. (SysCon 2012). Vancouver, 2012. 1–5.
[43] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. In: Proc.
of the 2nd ACM SIGOPS/EuroSys European Conf. on Computer Systems. New York: Association for Computing Machinery, 2007.
59–72.
[44] Power R, Li J. Piccolo: Building fast, distributed programs with partitioned tables. In: Proc. of the 9th USENIX Conf. on Operating
Systems Design and Implementation. 2010. 293–306.
[45] Carbone P, Fóra G, Ewen S, Haridi S, Tzoumas K. Lightweight asynchronous snapshots for distributed dataflows. arXiv Preprint
arXiv: 1506.08603, 2015.
[46] Garg R, Kumar P. A review of checkpointing fault tolerance techniques in distributed mobile systems. Int’l Journal on Computer
Science and Engineering, 2010,2(4):1052–1063.
[47] Bi YH, Jiang SY, Wang ZG, Leng FL, Bao YB, Yu G, Qian L. A multi-level fault tolerance mechanism for disk-resident Pregel-
like systems. Journal of Computer Research and Development, 2016,53(11):2530–2541
[48] Yan D, Cheng J, Chen H, Long C, Bangalore P. Lightweight fault tolerance in Pregel-like systems. In: Proc. of the 48th Int’l Conf.
on Parallel Processing. New York: Association for Computing Machinery, 2019. 1–10
[49] Xue J, Yang Z, Qu Z, Hou S, Dai Y. Seraph: An efficient, low-cost system for concurrent graph processing. In: Proc. of the 23rd
Int’l Symp. on High-performance Parallel and Distributed Computing. 2014. 227–238.
[50] Xu C, Holzemer M, Kaul M, Soto J, Markl V. On fault tolerance for distributed iterative dataflow processing. IEEE Trans. on
Knowledge and Data Engineering, 2017,29(8):1709–1722.
[51] Xu C, Holzemer M, Kaul M, Markl V. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In:
Proc. of the 32nd IEEE Int’l Conf. on Data Engineering (ICDE). 2016. 613–624.
[52] Vora K, Tian C, Gupta R, Hu Z. CoRAL: Confined recovery in distributed asynchronous graph processing. In: Proc. of the 32nd
Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. 2017. 223–236.
[53] Elnozahy EN, Alvisi L, Wang YM, Johnson D. A survey of rollback-recovery protocols in message-passing systems. ACM
Computing Surveys, 2002,34(3):375–408.
[54] Lu W, Shen Y, Wang T, Zhang M, Jagadish HV, Du X. Fast failure recovery in vertex-centric distributed graph processing systems.
IEEE Trans. on Knowledge and Data Engineering, 2019,31(4):733–746.
[55] Shen Y, Chen G, Jagadish HV, Lu W, Ooi BC, Tudor BM. Fast failure recovery in distributed graph processing systems. Proc. of
the VLDB Endowment, 2014,8(4):437–448.
[56] Kaur J, Kinger S. Analysis of different techniques used for fault tolerance. Int’l Journal of Computer Science and Information
Technologies, 2014,5(3):4086–4090.
[57] Pundir M, Leslie LM, Gupta I, Campbell RH. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In: Proc. of
the 6th ACM Symp. on Cloud Computing. New York: Association for Computing Machinery, 2015. 195–208.
[58] Wang P, Zhang K, Chen R, Chen H, Guan H. Replication-based fault-tolerance for large-scale graph processing. In: Proc. of the
44th Annual IEEE/IFIP Int’l Conf. on Dependable Systems and Networks. 2014. 562–573. [doi: 10.1109/DSN.2014.58]
[59] Chen R, Yao Y, Wang P, Zhang K, Guan H, Zang B, Chen H. Replication-based fault-tolerance for large-scale graph processing.
IEEE Trans. on Parallel and Distributed Systems, 2018,29(7):1621–1635.
[60] Presser D, Lung LC, Correia M. Greft: Arbitrary fault-tolerant distributed graph processing. In: Proc. of the 2015 IEEE Int’l
Congress on Big Data. New York, 2015. 452–459.
[61] Schelter S, Ewen S, Tzoumas K, Markl V. “All roads lead to Rome”: Optimistic recovery for distributed iterative data processing.
In: Proc. of the 22nd ACM Int’l Conf. on Information & Knowledge Management. 2013. 1919–1928.
[62] Marcotte P, Gregoire F, Petrillo F. Multiple fault-tolerance mechanisms in cloud systems: A systematic review. In: Proc. of the
2019 IEEE Int’l Symp. on Software Reliability Engineering Workshops (ISSREW). Berlin, 2019. 414–421.