Compressing Intermediate Keys between Mappers and Reducers in SciHadoop

High Performance Computing, Networking, Storage and Analysis(2012)

引用 15|浏览0
暂无评分
摘要
In Hadoop mappers send data to reducers in the form of key/value pairs. The default design of Hadoop's process for transmitting this intermediate data can cause a very high overhead, especially for scientific data containing multiple variables in a multi-dimensional space. For example, for a 3D scalar field of a variable “windspeed1” the size of keys was 6.75 times the size of values. Much of the disk and network bandwidth of “shuffling” this intermediate data is consumed by repeatedly transmitting the variable name for each value. This significant waste of resources is due to an assumption fundamental to Hadoop's design that all key/values are independent. This assumption is inadequate for scientific data which is often organized in regular grids, a structure that can be described in small, constant size. Earlier we presented SciHadoop, a slightly modified version of Hadoop designed for processing scientific data. We reported on experiments with SciHadoop which confirm that the size of intermediate data has a significant impact on overall performance. Here we show preliminary designs of multiple lossless approaches to compressing intermediate data, one of which results in up to five orders of magnitude reduction the original key/value ratio.
更多
查看译文
关键词
original key,multiple lossless approach,compressing intermediate keys,multiple variable,hadoop mapper,value ratio,constant size,scientific data,value pair,intermediate data,default design,data compression,public domain software,distributed processing,data handling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要