Full-text indexing for optimizing selection operations in large-scale data analytics

HPDC(2011)

引用 56|浏览2
暂无评分
摘要
ABSTRACTMapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one inefficient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.
更多
查看译文
关键词
modern analytics platform,cumulative processing time,hadoop open-source implementation,unstructured data begotten,full-text index,hadoop-based processing,large-scale data analytics,optimizing selection operation,end-to-end query,data block,full-text indexing,hadoop execution engine,design,web based applications,proof of concept,social media,cumulant,algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要