Page 225 - 《软件学报》2020年第11期
P. 225

软件学报 ISSN 1000-9825, CODEN RUXUEW                                       E-mail: jos@iscas.ac.cn
                 Journal of Software,2020,31(11):3540−3558 [doi: 10.13328/j.cnki.jos.005843]   http://www.jos.org.cn
                 ©中国科学院软件研究所版权所有.                                                         Tel: +86-10-62562563


                                                              ∗
                 基于采样的在线大图数据收集和更新

                      1
                                      1
                              1
                 尹子都 ,   岳   昆 ,   张彬彬 ,   李   劲  2
                 1
                 (云南大学  信息学院,云南  昆明   650500)
                 2
                 (云南大学  软件学院,云南  昆明   650500)
                 通讯作者:  岳昆, E-mail: kyue@ynu.edu.cn

                 摘   要:  互联网中,以网页、社交媒体和知识库等为载体呈现的大量非结构化数据可表示为在线大图.在线大图
                 数据的获取包括数据收集和更新,是大数据分析与知识工程的重要基础,但面临着数据量大、分布广、异构和变化
                 快速等挑战.基于采样技术,提出并行、自适应的在线大图数据收集和更新方法.首先,将分支限界方法与半蒙特卡
                 罗采样技术相结合,提出能够自适应地收集在线大图数据的 HD-QMC 算法;然后,为了使收集的数据能反映实际中
                 在线大图的动态变化,进一步基于信息熵及泊松过程,提出高效更新在线大图数据的 EPP 算法.从理论上分析了该算
                 法的有效性,并将获取的各类在线大图数据统一表示为 RDF 三元组的形式,为在线大图数据分析及相关研究提供方
                 便易用的数据基础.基于 Spark 实现了在线大图数据的收集和更新算法,人工生成数据和真实数据上的实验结果展
                 示了该方法的有效性和高效性.
                 关键词:  在线大图;数据收集;数据更新;并行爬虫;Spark
                 中图法分类号: TP311

                 中文引用格式:  尹子都,岳昆,张彬彬,李劲.基于采样的在线大图数据收集和更新.软件学报,2020,31(11):3540−3558.  http://
                 www.jos.org.cn/1000-9825/5843.htm
                 英文引用格式: Yin ZD, Yue K, Zhang BB, Li J. Sampling-based collection and updating of online big graph data. Ruan Jian Xue
                 Bao/Journal of Software, 2020,31(11):3540−3558 (in Chinese). http://www.jos.org.cn/1000-9825/5843.htm

                 Sampling-based Collection and Updating of Online Big Graph Data
                         1
                                                   1
                                   1
                 YIN Zi-Du ,  YUE Kun ,   ZHANG Bin-Bin ,   LI Jin 2
                 1 (School of Information Science and Engineering, Yunnan University, Kunming 650500, China)
                 2 (School of Software, Yunnan University, Kunming 650500, China)
                 Abstract:    The large volume of unstructured data obtained from Web pages, social media and knowledge bases on the Internet could be
                 represented as an online big  graph  (OBG). Confronted with many challenges,  such as its  large-scale, widespread,  heterogeneous,  and
                 fast-changing properties,  OBG data  acquisition includes data  collection  and updating,  which is  the basis of  massive data  analysis and
                 knowledge engineering. In this study, the method for adaptive and parallel data collection and updating is proposed based on sampling
                 techniques. First, the HD-QMC algorithm is given for adaptive data collection of OBG data by combining the branch-and-bound method
                 and quasi-Monte Carlo sampling technique. Next, the EPP algorithm is given for efficient data updating based on entropy and Poisson
                 process to  make  the  collected data reflect the dynamic change of  OBGs  in real-world  environments. Further,  the  effectiveness of the
                 proposed algorithms is analyzed theoretically, and various kinds of collected OBG data are represented by triples universally to provide an
                 easy-to-use data foundation for OBG analysis and relevant studies. Finally, the proposed algorithms for data collection and updating are

                   ∗  基金项目:  国家自然科学基金(U1802271, 62002311);  云南省基础研究计划杰出青年项目(2019FJ011);  云南省青年拔尖人才
                 培养支持计划(C6193032);  云南大学东陆学者培育计划
                      Foundation item:  National Natural  Science Foundation of  China (U1802271, 62002311);  Science Foundation for  Distinguished
                 Young Scholars of  Yunnan Province (2019FJ011);  Young  Talent Support Program  of  Yunnan Province(C6193032);  Donglu Scholars
                 Training Program of Yunnan University
                     收稿时间: 2018-10-25;  修改时间: 2018-12-08, 2019-01-16;  采用时间: 2019-03-26
   220   221   222   223   224   225   226   227   228   229   230