Learning Document Labels from Enriched Click Graphs

Data Mining Workshops(2010)

引用 1|浏览1
暂无评分
摘要
Document classification plays an increasingly important role in extracting and organizing the knowledge, however, the Web document classification task was hindered by the huge number of Web documents while limited resource of human judgment on the training data. To obtain sufficient training data in a cost-efficient way, in this paper, we propose a semi-supervised learning approach to predict a document’s class label by mining the click graph. To overcome the sparseness problem of click graph, we enrich it by including hyperlinks between the Web documents. Content-based constraints are further added to regularize the graph. The resulting graph unifies three data sources: click-through data, hyperlinks and content relevance. Starting from a very small seed set of manually labeled documents, we automatically explore large amount of relevant documents by applying a Markov random walk model to the enriched click graph. The top pages with high confidence scores are included to the current training data for classifier model training. We investigate various combinations among the three sources and conduct extensive experiments on six typical web classification tasks. The experimental results show that the click graph enriched with hyperlink and content information can significantly improve the classification quality across multiple tasks only with a minimal human labeling cost.
更多
查看译文
关键词
training data,current training data,web document,click graph,learning document labels,click-through data,web document classification task,resulting graph,sufficient training data,data source,enriched click graph,bipartite graph,internet,semi supervised learning,graph theory,learning artificial intelligence,markov processes,data mining,random walk,web pages,hyperlinks,cost efficiency,data models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要