Similarity and Locality Based Indexing for High Performance Data Deduplication
IEEE Trans. Computers(2015)
摘要
Data deduplication has gained increasing attention and popularity as a space-efficient approach in backup storage systems. One of the main challenges for centralized data deduplication is the scalability of fingerprint-index search. In this paper, we propose SiLo, a near-exact and scalable deduplication system that effectively and complementarily exploits similarity and locality of data streams to achieve high duplicate elimination, throughput, and well balanced load at extremely low RAM overhead. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage the locality in the data stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. SiLo also employs a locality based stateless routing algorithm to parallelize and distribute data blocks to multiple backup nodes. By judiciously enhancing similarity through the exploitation of locality and vice versa, SiLo is able to significantly reduce RAM usage for index-lookup, achieve the near-exact efficiency of duplicate elimination, maintain a high deduplication throughput, and obtain load balance among backup nodes.
更多查看译文
关键词
ram overhead,data stream locality,ram usage reduction,data block parallelization,strongly-correlated small-file grouping,contiguous segment grouping,locality based stateless routing algorithm,index structure,data deduplication,data stream similarity,silo,high-performance data deduplication,space-efficient approach,similarity based indexing,index-lookup,resource allocation,duplicate elimination,database indexing,fingerprint-index search scalability,near-exact efficiency,locality leveraging,deduplication throughput,large-file segmentation,performance evaluation,centralized data deduplication,near-exact-scalable deduplication system,meta data,storage system,probabilistic similarity detection,locality based indexing,backup nodes,load balancing,data block distribution,backup storage systems,indexing,scalability,servers,throughput,probabilistic logic
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络