Accurate discovery of co-derivative documents via duplicate text detection

Information Systems(2006)

引用 29|浏览0
暂无评分
摘要
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype package that combines the SPEX algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.
更多
查看译文
关键词
spex algorithm,document fingerprinting,existing technique,hash value,co-derivative cluster,multi-gigabyte document collection,duplicate text detection,scalable co-derivative discovery system,fingerprinting,duplicate detection,accurate discovery,selected document subsequence,novel hash-based algorithm,document collection,hashing,co-derivative document,indexation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要