Athena: Mining-Based Interactive Management Of Text Databases
EDBT '00: Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology(2000)
摘要
We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive mining-based operations. Requirements of any such system include speed and minimal end-user effort. Athena satisfies these requirements through linear-time classification and clustering engines which axe applied interactively to speed the development of accurate models.Naive Bayes classifiers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classifier is considerably more accurate (7 to 29% absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject.We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve first finds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classification algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20% absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods.
更多查看译文
关键词
Concept Drift, Agglomerative Cluster, True Cluster, Classi Cation, Text Cluster
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络