Page 160 - 《软件学报》2021年第7期
P. 160
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(7):20782102 [doi: 10.13328/j.cnki.jos.006269] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
面向分布式图计算作业的容错技术研究综述
1,2
1
张程博 , 李 影 , 贾 统 3
1
(北京大学 软件与微电子学院,北京 102600)
2
(北京大学 软件工程国家工程研究中心,北京 100871)
3 (北京大学 信息科学技术学院,北京 100871)
通讯作者: 李影, E-mail: li.ying@pku.edu.cn
摘 要: 随着图数据规模的日益庞大和图计算作业的日益复杂,图计算的分布化成为必然趋势.然而图计算作业
在运行过程中面临着分布式图计算系统内外各种来源的非确定性所带来的严峻的可靠性问题.首先分析了分布式
图计算框架中不确定性因素和不同类型图计算作业的鲁棒性,并提出了基于成本、效率和质量 3 个维度的面向分
布式图计算作业的容错技术评估框架,然后分别对分布式图计算的 4 种容错机制——基于检查点的容错、基于日
志的容错、基于复制的容错、基于算法补偿的容错等机制结合国内外相关工作做了深入的分析、评估和比较.最
后对未来的研究方向进行了展望.
关键词: 图数据;故障和失效;分布式图计算;容错机制;非确定性软件系统
中图法分类号: TP311
中文引用格式: 张程博,李影,贾统.面向分布式图计算作业的容错技术研究综述.软件学报,2021,32(7):20782102. http://www.
jos.org.cn/1000-9825/6269.htm
英文引用格式: Zhang CB, Li Y, Jia T. Survey of state-of-the-art fault tolerance for distributed graph processing jobs. Ruan Jian
Xue Bao/Journal of Software, 2021,32(7):20782102 (in Chinese). http://www.jos.org.cn/1000-9825/6269.htm
Survey of State-of-the-art Fault Tolerance for Distributed Graph Processing Jobs
1,2
1
ZHANG Cheng-Bo , LI Ying , JIA Tong 3
1 (School of Software and Microelectronics, Peking University, Beijing 102600, China)
2 (National Engineering Research Center for Software Engineering, Peking University, Beijing 100871, China)
3 (School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China)
Abstract: As the growth of graph data scale and complexity of graph processing, the trend of distributed graph processing shall be
inevitable. However, graph processing jobs run with severe reliability problems caused by the uncertainty originated from inside and
outside the distributed graph processing system. This study first analyzes the uncertainty factors of the distributed graph processing
frameworks and the robustness of different types of graph processing jobs; then proposes an evaluation framework of fault tolerance for
distributed graph processing based on cost, efficiency, and quality of fault tolerance. This study also analyzes, evaluates, and compares the
four fault-tolerant mechanisms of distributed graph processing—checkpointing based fault tolerance, logging based fault tolerance,
replication based fault tolerance, and algorithm compensation based fault tolerance—combining related researches. Finally, the direction
of future researches is prospected.
Key words: graph data; fault and failure; distributed graph processing; fault tolerance; uncertainty software system
基金项目: 广东省重点领域研发计划(2020B010164003)
Foundation item: Key-area Research and Development Program of Guangdong Province, China (2020B010164003)
本文由“面向非确定性的软件质量保障方法与技术”专题特约编辑陈俊洁副教授、汤恩义副教授、何啸副教授以及马晓星教
授推荐.
收稿时间: 2020-09-15; 修改时间: 2020-10-26; 采用时间: 2020-12-14; jos 在线出版时间: 2021-01-22