Massive Data Load on Distributed Database Systems over HBase.

CCGrid(2017)

引用 22|浏览47
暂无评分
摘要
Big Data has become a pervasive technology to manage the ever-increasing volumes of data. Among Big Data solutions, scalable data stores play an important role, especially, key-value data stores due to their large scalability (thousands of nodes). The typical workflow for Big Data applications include two phases. The first one is to load the data into the data store typically as part of an ETL (Extract-Transform-Load) process. The second one is the processing of the data itself. BigTable and HBase are the preferred key-value solutions based on range-partitioned data stores. However, the loading phase is inefficient and creates a single node bottleneck. In this paper, we identify and quantify this bottleneck and propose a tool for parallel massive data loading that solves satisfactorily the bottleneck enabling all the parallelism and throughput of the underlying key-value data store during the loading phase as well. The proposed solution has been implemented as a tool for parallel massive data loading over HBase, the key-value data store of the Hadoop ecosystem.
更多
查看译文
关键词
HBase, MapReduce, HDFS, Split table, Massive Data load
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要