Column Cache: Buffer Cache for Columnar Storage on HDFS

2018 IEEE International Conference on Big Data (Big Data)(2018)

引用 3|浏览17
暂无评分
摘要
Columnar storage is a data source for data analytics in distributed computing frameworks. For portability and scalability, columnar storage is built on top of existing distributed file systems with columnar data representations such as Parquet, RCFile, and ORC. However, these representations fail to utilize high-level information (e.g., columnar formats) for low-level disk buffer management in operating systems. As a result, data analytics workloads suffer from redundant memory buffers with expensive garbage collections, unnecessary disk readahead, and cache pollution in the operating system buffer cache.We propose column cache, which unifies and re-structures the buffers and caches of multiple software layers from columnar storage to operating systems. Column cache leverages high-level information such as file formats and query plans for enabling adaptive disk reads and cache eviction policies. We have developed a column cache prototype for Apache Parquet and observed that our prototype reduced redundant resource utilization in Apache Spark. Specifically, with our prototype, Spark showed a maximum speedup of 1.28x in TPC-DS workloads while increasing Linux page cache size by 18%, reducing total disk reads by 43%, and reducing garbage collection time in a Java virtual machine by 76%.
更多
查看译文
关键词
low-level disk buffer management,data analytics workloads,redundant memory buffers,cache pollution,operating system buffer cache,cache eviction policies,column cache prototype,Linux page cache size,distributed computing frameworks,columnar data representations,columnar formats,high-level information,distributed file systems,reduced redundant resource utilization,columnar storage buffer cache,Java virtual machine
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要