Identifying File Similarity in Large Data Sets by Modulo File Length.

ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT II(2014)

引用 3|浏览16
暂无评分
摘要
Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.
更多
查看译文
关键词
file similarity,large data sets,position shifted,simhash
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要