Enhancing Binary Classification By Modeling Uncertain Boundary In Three-Way Decisions (Extended Abstract)

2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE)(2018)

引用 75|浏览29
暂无评分
摘要
Text classification techniques are playing a crucial role in identifying relevant texts from a large data set, e.g., various online crimes such as Cyberbullying, terrorist recruiting, propaganda or attack planning. Until now, supervised deep learning has brought about breakthroughs in processing multimedia data; however, there was no good practical way to harvest this opportunity for text classification because acquiring and maintaining a massive amount of training examples are too expensive for a large number of categories (e.g., Yahoo! taxonomy contains nearly 300,000 categories and the Library of Congress Subject Headings (LCSH) contains 394,070 subjects). Therefore, the question of how to effectively learn from sparse or small set of training examples is crucial for the true success of text classification. Semi-supervised approaches have been proposed for this challenge, which usually use a pair or several existing classifiers to extend a small training set. However, extracted pseudo training samples are uncertain because they are determined by a machine rather than people. Also, the massive volume and high variability of text data are creating a number of challenging issues such as the scalability and complicated relations between words. There are two fundamental issues with regards to the performance of existing classifiers: overlook and overload. Overlook means that some objects relevant to a class have been omitted, whereas overload means that some objects assigned to a class are actually not relevant to that class. The two issues are even more serious in the following two cases: (1) large uncertain boundary - the decision boundary between two classes includes many mixed examples (e.g., relevant and nonrelevant documents together), and (2) unbalanced classes - one class (e.g., information about terrorist attacks) is much smaller than another class (e.g., normal descriptions). We propose a three-way decision model [1] for dealing with the uncertain boundary for improving text classification performance based on rough set techniques and centroid solution. It aims to understand the uncertain boundary through partitioning the training samples into three regions (the positive, boundary and negative regions) by two main boundary vectors created from the labeled positive and negative training subsets, respectively, and further resolve the objects in the boundary region by two derived boundary vectors produced according to the structure of the boundary region. Four decision rules are proposed from the training process and applied to the incoming documents for more precise classification. The experimental results on the standard data sets RCV1 and Reuters-21578 show that the usage of boundary vectors is very effective and efficient for dealing with uncertainties of the decision boundary, and the proposed model has significantly improved the performance of binary text classification in terms of F1 measure and AUC area compared with six other popular baseline models.
更多
查看译文
关键词
Uncertain decision boundary,Text classification,Three way decision,Rough set,Decision rule
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要