Page 250 - 《软件学报》2021年第11期
P. 250
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2021,32(11):3576−3595 [doi: 10.13328/j.cnki.jos.006097] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
∗
面向大数据流的分布式索引构建
1
1
1
1
杨良怀 , 卢晨曦 , 范玉雷 , 朱镇洋 , 潘 建 2
1
(浙江工业大学 计算机学院,浙江 杭州 310023)
2
(浙江工业大学 之江学院,浙江 绍兴 312030)
通讯作者: 范玉雷, E-mail: fyl815@zjut.edu.cn
摘 要: 大数据流的高效存储与索引是当今数据领域的一大难点.面向带有时间属性的数据流,根据其时间属性,
将数据流划分为连续的时间窗口,提出了基于双层 B+树的分布式索引结构 WB-Index.下层 B+树索引基于窗口内流
数据构建,索引构建过程结合基于排序的批量构建技术,进一步对时间窗口分片,将数据流接收、分片数据排序以及
B+树构建并行化,提高了构建性能.上层 B+树索引基于各时间窗口构建,结合时间窗口时间戳的递增性和无限性,提
出了避免节点分裂的构建方法,减少了 B+树分裂移动开销,提高了空间利用率和更新效率.WB-Index 架构中,将流数
据和索引分离,同时利用内存缓存尽可能多的双层 B+索引和热点数据来提高查询性能.理论和实验结果表明,该分
布式索引架构能够支持高效的实时数据流写入以及流数据查询,能够很好地应用于具有时间属性的数据流场景.
关键词: 大数据;数据流;分布式索引;B+树
中图法分类号: TP311
中文引用格式: 杨良怀,卢晨曦,范玉雷,朱镇洋,潘建.面向大数据流的分布式索引构建.软件学报,2021,32(11):3576−3595.
http://www.jos.org.cn/1000-9825/6097.htm
英文引用格式: Yang LH, Lu CX, Fan YL, Zhu ZY, Pan J. Distributed index construction for big data streams. Ruan Jian Xue
Bao/Journal of Software, 2021,32(11):3576−3595 (in Chinese). http://www.jos.org.cn/1000-9825/6097.htm
Distributed Index Construction for Big Data Streams
1
1
1
1
YANG Liang-Huai , LU Chen-Xi , FAN Yu-Lei , ZHU Zhen-Yang , PAN Jian 2
1 (School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China)
2 (Zhijiang College, Zhejiang University of Technology, Shaoxing 312030, China)
Abstract: Efficient storage and indexing of big data streams are challenging issues in the database field. By segmenting the temporal
data stream into continuous time windows, a distributed master-slave index structure is proposed based on double-layer B+ tree called
WB-Index. Lower B+ tree index is built on stream tuples in each time window. Upper B+ tree index is built on each successive time
window. Lower B+ tree index is constructed by combining both batch loading and parallel sorting techniques. The core idea of the
construction method is to slice the time window and isolate the parallelable operations from others in the time window. Sorting and data
stream receiving between slices work in parallel, while the B+ tree skeleton (a B+ tree without value) construction for the time window
and the merge-sorting operation are parallelized as well. These techniques effectively expedite the B+ tree construction. Due to the
monotonous increasement of timestamps of time windows, a split-less method for upper B+ tree index construction is adopted to avoid the
node splitting and memory movement overhead, and improve the space utilization and update efficiency. In WB-Index, data stream tuples
and index are separated, and index and hotspot data are cached as much as possible to improve query efficiency. Finally, theoretic analysis
and experiments have both demonstrated that WB-Index can support efficient real-time data stream writing and stream data querying.
Key words: big data; data stream; distributed index; B+ tree
∗ 基金项目: 国家重点研发计划(2020YFB1707700)
Foundation item: National Key Research and Development Program of China (2020YFB1707700)
收稿时间: 2019-10-29; 修改时间: 2019-12-25; 采用时间: 2020-02-21