Page 267 - 《软件学报》2021年第11期
P. 267
杨良怀 等:面向大数据流的分布式索引构建 3593
6 结束语
本文面向完整保存数据流的应用,结合数据流的特点,提出双层分布式索引:WB-Index.在保证索引构建效
率的同时,通过高效缓存保障查询效率;通过负载均衡策略和上层索引冗余机制,保证索引系统的可用性和稳定
性.针对分布式索引构建:采用时间窗口的方式切分数据流,基于窗口内流元组构建下层索引,采用高度并行化、
批量化的方式加速构建性能;上层索引基于所有时间窗口构建,考虑到时间窗口时间戳的递增性,提出避免分裂
的 B+树插入策略,减少 B+树分裂移动开销,提高空间利用率,并提出预分配策略提高数据插入的性能.通过实验
验证分布式索引的构建性能,相比起 CG-index 和 LSM 存储模型均有很大的性能优势,适用于高速数据流场景.
今后将进一步研究双层分布式索引的持久化方法.
致谢 感谢阿里巴巴数据库事业部为本文的研究提供了实验所需的软硬件资源,感谢彭立勋先生为本研究提
供诸多讨论和帮助.
References:
[1] Golab L, Özsu MT. Issues in data stream management. ACM SIGMOD Record, 2003,32(2):5−14.
[2] Xie J, Yang J. A survey of join processing in data streams. In: Data Streams: Models and Algorithms. 2007. 209−236.
[3] Acharya S, Gibbons PB, Poosala V, Ramaswamy S. Join synopses for approximate query answering. ACM SIGMOD Record, 1999,
28(2):275−286.
[4] Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data streams. In: Proc. of the PODS Conf. 2002. 1−16.
[5] Arasu A, Babcock B, Babu S, Cieslewicz J, Datar M, Ito K, Motwani R, Srivastava U, Widom J. Stream: The Stanford data stream
management system. In: Proc. of the ACM SIGMOD Int’l Conf. on Management of Data. 2003.
[6] Carney D, Cetinternel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S. Monitoring streams: A
new class of dbms applications. In: Proc. of the VLDB Conf. 2002. 215−226.
[7] Krishnamurthy S, Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Madden S, Reiss F, Shah
MA. TelegraphCQ: An architectural status report. Data Engineering Bulletin, 2003,26(1):11−18.
[8] Toshniwal A, Taneja S, Shukla A, Ramasamy K, Kulkarni S, Jackson J, Gade K, Fu MS, Donham J, Bhagat N, Mittal S, Ryaboy D.
Storm@twitter. In: Proc. of the VLDB Int’l Conf. on Management of Data. 2014. 147−156.
[9] Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink: Stream and batch processing in a single engine.
Data Engineering Bulletin, 2015,36(4):28−38.
[10] Zhang DD, Li JZ, Wang WP, Guo LJ. Algorithms for storing and aggregating historical streaming data. Ruan Jian Xue Bao/Journal
of Software, 2005,16(12):2089−2098 (in Chinese with English abstract). http://jos.org.cn/jos/article/abstract/20051207?st=article_
issue [doi: 10.1360/jos162089]
[11] Ge JW, Gong PQ, Liu ZH. Method of storing and indexing historical streaming data. Application Research of Computers, 2007,
24(6):104−106 (in Chinese with English abstract).
[12] Lahiri T, Neimat MA, Folkman S. Oracle TimesTen: An in-memory database for enterprise applications. Data Engineering Bulletin,
2013,36(2):6−13.
[13] Larson PÅ, Zwilling M, Farlee K. The Hekaton memory-optimized OLTP engine. Data Engineering Bulletin, 2013,36(2):34−40.
[14] Lindström J, Raatikka V, Ruuth J, Soini P, Vakkila K. IBM solidDB: In-memory database optimized for extreme speed and
availability. Data Engineering Bulletin, 2013,36(2):14−20.
[15] O’Neil P, Cheng E, Gawlick D, O’Neil E. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996,33(4):351−385.
[16] Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Gruber RE, Chandra T, Fikes A, Gruber RE. Bigtable: A
distributed storage system for structured data. ACM Trans. on Computer Systems, 2008,26(2):1−26.
[17] Antaris S, Rafailidis D. In-memory stream indexing of massive and fast incoming multimedia content. IEEE Trans. on Big Data,
2018, 4(1):40−54.
[18] Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the ACM Symp. on Operating Systems Principles. ACM,
2003. 20−43.
[19] Lakshman A, Malik P. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010,
44(2):35−40.