Cluster-Based Delta Compression of a Collection of Files

WISE(2002)

引用 83|浏览60
暂无评分
摘要
Delta compression techniques are commonly used tosuccinctly represent an updated version of a file with respectto an earlier one. In this paper, we study the use ofdelta compression in a somewhat different scenario, wherewe wish to compress a large collection of (more or less) relatedfiles by performing a sequence of pairwise delta compressions.The problem of finding an optimal delta encodingfor a collection of files by taking pairwise deltas can bereduced to the problem of computing a branching of maximumweight in a weighted directed graph, but this solutionis inefficient and thus does not scale to larger file collections.This motivates us to propose a framework for cluster-baseddelta compression that uses text clustering techniquesto prune the graph of possible pairwise delta encodings. Todemonstrate the efficacy of our approach, we present experimentalresults on collections of web pages. Our exper-imentsshow that cluster-based delta compression of col-lectionsprovides significant improvements in compressionratio as compared to individually compressing each file orusing tar+gzip, at a moderate cost in efficiency.
更多
查看译文
关键词
Web sites,data compression,directed graphs,file organisation,very large databases,Web pages,cluster-based delta compression,compression ratio,experimental results,file collection,large data set,optimal delta encoding,text clustering techniques,weighted directed graph
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要