Ameliorating data compression and query performance through cracked Parquet.

ACM Conference on Management of Data(2022)

引用 3|浏览9
暂无评分
摘要
In this paper, we propose to exploit synergy effects between partitioning and compression for Dremel-encoded nested data serving as the data storage for Spark-style processing jobs. The encoding proposed with Dremel has found widespread use in the form of open approaches like Apache Parquet, which can be used with a multitude of storage engines and processing frameworks, like Apache Spark. It stores the presence of objects in additional columns compressed using run-length encoding. Using partitioning, we can decrease the number of runs while at the same time using the partitions for data skipping. These effects can achieve a compression ratio of 1.37 while also reducing the query runtime by a factor of 1.22 in our test setup.
更多
查看译文
关键词
disaggregated systems, compression, partitioning, data skipping, Dremel-encoding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要