RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

Data Engineering(2011)

引用 455|浏览1
暂无评分
摘要
MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!
更多
查看译文
关键词
web service providers,mapreduce-based warehouse system,facebook production systems,yahoo!,mapreduce-based data warehouse system,distributed systems,big data,web services,data warehouses,storage space utilization,large data analysis,social network sites,record columnar file,data structures,hadoop system,data analysis,data processing,mapreduce-based warehouse systems,data analytics,data analysis system,data placement structure,fast data loading,big data placement structure,user behavior trends,big data analytics,facebook data warehouse system,fast query processing,space-efficient data placement structure,social networking (online),accepted data placement structure,rcfile,query processing,data warehouse,satisfiability,data storage,production system,distributed system,data handling,information management,web service,data compression
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要