Domain concept handling in automated text categorization

Industrial Electronics and Applications(2010)

引用 3|浏览1
暂无评分
摘要
Single term based document representations, e.g. bag-of-words, have been widely accepted in the machine learning, information retrieval and text mining community. One notable limitation of such methods is that they do not consider the rich information resident in the semantic relations among terms. This paper reports our approach of concepts handling in document representation and its effect on the performance of text categorization. We introduce a Frequent word Sequence algorithm that generates concept-centered phrases to render domain knowledge concepts. Our experimental study based on a domain centered corpus shows that a consistent performance improvement can be achieved when concept-centered phrases are included in addition to the single term based features in document representations. We also observed that a universally suitable support threshold does not exist and the removal of concept irrelevant sequences can possibly further improve the performance at a lower support level.
更多
查看译文
关键词
data mining,information retrieval,learning (artificial intelligence),text analysis,automated text categorization,bag-of-words,document representations,domain concept handling,domain knowledge,machine learning,text mining,domain concept representation,information management,text categorization,mining industry,support vector machines,learning artificial intelligence,bag of words,indexing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要