AIR: Adaptive Index Replacement in Hadoop

ICDE Workshops(2015)

引用 14|浏览16
暂无评分
摘要
The Hadoop Distributed Filesystem has become the de-facto standard for storing large datasets in data management systems such as Hadoop MapReduce, Hive, and Stratosphere. Though HDFS was originally designed to support scan-oriented operations, recently several techniques for HDFS have been developed to allow for efficient indexing. One of these indexing techniques is aggressive indexing, i.e. HDFS replicas are immediately indexed at upload time before touching any disk - creating multiple clustered indexes almost for free on the way. A second technique is adaptive indexing, i.e. HDFS blocks are only indexed on demand as a side effect of query processing. Though these techniques provide impressive speed-ups in terms of query processing, they totally ignored the costs involved with storing a large number of replicas of a particular dataset. The HDFS-variants of adaptive indexing were already designed to leverage the natural redundancy that comes with HDFS, typically storing a dataset three times anyway. However, it is questionable whether storing an unlimited number of replicas for a dataset is a practical solution. Therefore, this paper is the first to analyze adaptive indexing under a space constraint, i.e. we assume that indexes are adaptively created and deleted. We coin this problem the Adaptive Index Replacement problem. We present a new algorithm to solve the online AIR problem called LeastExpectedBenefit-K and compare it with several existing state-of-the-art online Index Selection algorithms. We present a comprehensive study evaluating ten different algorithms. Our results show that our algorithm LEB-2 is efficient and robust and a good choice in practice.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要