Ameliorating data compression and query performance through cracked Parquet.
ACM Conference on Management of Data(2022)
摘要
In this paper, we propose to exploit synergy effects between partitioning and compression for Dremel-encoded nested data serving as the data storage for Spark-style processing jobs. The encoding proposed with Dremel has found widespread use in the form of open approaches like Apache Parquet, which can be used with a multitude of storage engines and processing frameworks, like Apache Spark. It stores the presence of objects in additional columns compressed using run-length encoding. Using partitioning, we can decrease the number of runs while at the same time using the partitions for data skipping. These effects can achieve a compression ratio of 1.37 while also reducing the query runtime by a factor of 1.22 in our test setup.
更多查看译文
关键词
disaggregated systems, compression, partitioning, data skipping, Dremel-encoding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要