Identifying Web Pages with Major Contents based on Search Engine Suggests and Topic Modeling

semanticscholar(2017)

引用 0|浏览0
暂无评分
摘要
This paper addresses the problem of identifying irrelevant items from a small set of similar documents using Web search engine suggests. Specifically, we collected volumes of Web pages through Web search engines and inspected the page contents using topic models. Among each cluster of pages sharing the same topic indicated by the topic model, our technique discovers potential content organization in the current page cluster and identifies pages that are out of focus from that topic. The metrics in our approach mainly consist of search engine suggest frequency and inter-document similarity measures. The intuition is that Web pages collected via the same search queries are more likely to share similar contents. We verify this intuition by implementing a subtopic based document selection framework and making quantitative evaluation against human made labeled data sets. Our evaluation result reveals that suggest frequency analysis along with inter-document similarity measure is effective at filtering off-topic documents in small data sets with satisfactory performance.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要