Categorization and Similarity Analysis : Implementation and Evaluation

mag(2014)

引用 23|浏览2
暂无评分
摘要
Executive Summary This report covers the implementation of software that aims to identify document versions and semantically related documents. This is important due to the increasing amount of digital information. Key criteria were that the software was fast and required limited disk space. Previous research determined that the Simhash algorithm was the most appropriate for this application so this method was implemented. The structure of each component was well defined with the inputs and outputs constant and the result was a software system that can have interchangeable parts if required. The software was tested on three document corpuses to try and identify the strengths and weaknesses of the calculations used. Initial modifications were made to parameters such as the size of shingles and the length of hash value to ensure hash values were unique and unrelated phrases were not hashed to similar values. Not surprisingly longer hash values gave more accuracy and the run-time was not increased significantly. Increasing shingle size also gave a better reflection of the uniqueness of each input phrase. The naive implementation performed moderately on a custom made document version corpus but the similarity values were low for documents with only a few word changes. Using a similarity measure based on the Jaccard Index was more accurate. The software was able to successfully identify most document versions correctly and only had issues with the merging and separating of paragraphs. A theoretical solution was described for how this issue could be resolved. Testing for semantically similar documents was limited compared to the testing for versions as finding document versions was identified as the focus. Initial testing showed that hashing the extracted entities for each paragraph returned values with limited information for each paragraph. Future work should analyse the entities at document level rather than paragraph level.
更多
查看译文
关键词
computer science,working paper
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要