Using Document Classification To Improve The Performance Of A Plagiarism Checker: A Case For Thai Language Documents

2017 21ST INTERNATIONAL COMPUTER SCIENCE AND ENGINEERING CONFERENCE (ICSEC 2017)(2017)

引用 0|浏览1
暂无评分
摘要
The performance of a plagiarism checker can be prohibitively expensive if the size of the document database to be checked against is large. To improve the checker's performance, we propose a method that organizes the document database into categories based on a label assigned to the document. The label is derived from simple heuristics applied over the first few pages of the document. For the original document to be checked for possible plagiarism, a set of probability values is assigned based on the likelihood of its belonging to specific categories. The checker then examines only the relevant categories, obviating the need to check against the whole database.This paper focuses primarily on documents in the Thai language. We use the datasets that contain theses and journals from Kasetsart University from years 1998 to 2010. We found that, for the given datasets, only 5 out of 20 classes need to be searched against to maintain the same accuracy as when all the classes are searched against.
更多
查看译文
关键词
Thai document classification, plagiarism checking, machine learning techniques
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要