Predicting Deduplication Performance: An Analytical Model and Empirical Evaluation

Owen Randall,Paul Lu

2022 IEEE International Conference on Big Data (Big Data)(2022)

引用 0|浏览6
暂无评分
摘要
Deduplication is a technique to find and eliminate redundant blocks of data for efficient data backups, efficient versioning, reduced data transfers, and reduced data-storage overheads. For large datasets, especially with incremental updates over time (e.g., instrumentation data) and subsetting (e.g., for auxiliary experiments), deduplication makes data management faster and more efficient. The primary parameter of deduplication systems is the expected chunk size, and while many existing systems use accepted default values (e.g., 4 KB or 8 KB chunks), our experiments find that these values are suboptimal for finding duplicate data. Suboptimal deduplication and data management makes it harder for researchers to manipulate, share, and experiment with large datasets.We present the design, implementation, and an empirical validation of our analytical model that predicts the performance of deduplication parameters (i.e., ability to find duplicate data) on any given dataset. The empirical evaluation includes workloads based on source code (i.e., Linux kernel, Kubernetes, TensorFlow), an open-research dataset (i.e., CORD-19), and Wikipedia. Our experiments show that our model finds deduplication parameters that reduce the storage requirements by up to an additional 30.72% compared to a commonly used baseline. Our model is up to 19.8x faster than scanning, and the resulting deduplicated datasets are all within 5.14% of the deduped sizes found via the scan-based search.
更多
查看译文
关键词
deduplication performance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要