Page 183 - 《软件学报》2021年第7期
P. 183

张程博  等:面向分布式图计算作业的容错技术研究综述                                                      2101


                [40]    Salfner F, Lenk M, Malek M. A survey of online failure prediction methods. ACM Computing Surveys, 2010,42(3):1–42.
                [41]    Wang Z, Gu Y, Bao Y, Yu G, Gao L. An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations.
                     Distributed and Parallel Databases, 2017,35(2):177–196.
                [42]    Jhawar R, Piuri V, Santambrogio M. A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In:
                     Proc. of the 2012 IEEE Int’l Systems Conf. (SysCon 2012). Vancouver, 2012. 1–5.
                [43]    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. In: Proc.
                     of the 2nd ACM SIGOPS/EuroSys European Conf. on Computer Systems. New York: Association for Computing Machinery, 2007.
                     59–72.
                [44]    Power R, Li J. Piccolo: Building fast, distributed programs with partitioned tables. In: Proc. of the 9th USENIX Conf. on Operating
                     Systems Design and Implementation. 2010. 293–306.
                [45]    Carbone P, Fóra G, Ewen S, Haridi S, Tzoumas K. Lightweight asynchronous snapshots for distributed dataflows. arXiv Preprint
                     arXiv: 1506.08603, 2015.
                [46]    Garg R, Kumar P. A review of checkpointing fault tolerance techniques in distributed mobile systems. Int’l Journal on Computer
                     Science and Engineering, 2010,2(4):1052–1063.
                [47]    Bi YH, Jiang SY, Wang ZG, Leng FL, Bao YB, Yu G, Qian L. A multi-level fault tolerance mechanism for disk-resident Pregel-
                     like systems. Journal of Computer Research and Development, 2016,53(11):2530–2541
                [48]    Yan D, Cheng J, Chen H, Long C, Bangalore P. Lightweight fault tolerance in Pregel-like systems. In: Proc. of the 48th Int’l Conf.
                     on Parallel Processing. New York: Association for Computing Machinery, 2019. 1–10
                [49]    Xue J, Yang Z, Qu Z, Hou S, Dai Y. Seraph: An efficient, low-cost system for concurrent graph processing. In: Proc. of the 23rd
                     Int’l Symp. on High-performance Parallel and Distributed Computing. 2014. 227–238.
                [50]    Xu C, Holzemer M, Kaul M,  Soto J, Markl V. On  fault  tolerance for distributed iterative dataflow processing. IEEE Trans. on
                     Knowledge and Data Engineering, 2017,29(8):1709–1722.
                [51]    Xu C, Holzemer M, Kaul M, Markl V. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In:
                     Proc. of the 32nd IEEE Int’l Conf. on Data Engineering (ICDE). 2016. 613–624.
                [52]    Vora K, Tian C, Gupta R, Hu Z. CoRAL: Confined recovery in distributed asynchronous graph processing. In: Proc. of the 32nd
                     Int’l Conf. on Architectural Support for Programming Languages and Operating Systems. 2017. 223–236.
                [53]    Elnozahy EN,  Alvisi  L,  Wang  YM, Johnson  D.  A survey of rollback-recovery protocols in  message-passing systems.  ACM
                     Computing Surveys, 2002,34(3):375–408.
                [54]    Lu W, Shen Y, Wang T, Zhang M, Jagadish HV, Du X. Fast failure recovery in vertex-centric distributed graph processing systems.
                     IEEE Trans. on Knowledge and Data Engineering, 2019,31(4):733–746.
                [55]    Shen Y, Chen G, Jagadish HV, Lu W, Ooi BC, Tudor BM. Fast failure recovery in distributed graph processing systems. Proc. of
                     the VLDB Endowment, 2014,8(4):437–448.
                [56]    Kaur J, Kinger S.  Analysis of different  techniques used for fault tolerance. Int’l  Journal of  Computer Science  and Information
                     Technologies, 2014,5(3):4086–4090.
                [57]    Pundir M, Leslie LM, Gupta I, Campbell RH. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In: Proc. of
                     the 6th ACM Symp. on Cloud Computing. New York: Association for Computing Machinery, 2015. 195–208.
                [58]    Wang P, Zhang K, Chen R, Chen H, Guan H. Replication-based fault-tolerance for large-scale graph processing. In: Proc. of the
                     44th Annual IEEE/IFIP Int’l Conf. on Dependable Systems and Networks. 2014. 562–573. [doi: 10.1109/DSN.2014.58]
                [59]    Chen R, Yao Y, Wang P, Zhang K, Guan H, Zang B, Chen H. Replication-based fault-tolerance for large-scale graph processing.
                     IEEE Trans. on Parallel and Distributed Systems, 2018,29(7):1621–1635.
                [60]    Presser  D, Lung  LC,  Correia  M. Greft: Arbitrary fault-tolerant distributed graph processing. In: Proc. of the 2015 IEEE Int’l
                     Congress on Big Data. New York, 2015. 452–459.
                [61]    Schelter S, Ewen S, Tzoumas K, Markl V. “All roads lead to Rome”: Optimistic recovery for distributed iterative data processing.
                     In: Proc. of the 22nd ACM Int’l Conf. on Information & Knowledge Management. 2013. 1919–1928.
                [62]    Marcotte P, Gregoire F, Petrillo F. Multiple fault-tolerance mechanisms in cloud systems: A systematic review. In: Proc. of the
                     2019 IEEE Int’l Symp. on Software Reliability Engineering Workshops (ISSREW). Berlin, 2019. 414–421.
   178   179   180   181   182   183   184   185   186   187   188