Discovering Informative Contents of Web Pages.

WAIM(2014)

引用 2|浏览40
暂无评分
摘要
The World Wide Web has become a huge information repository. However, besides informative contents, the Web pages also contain redundant contents, which are considered harmful for Web mining and searching systems. In this paper, we propose a new approach to discover informative contents from a set of Web pages within a single Web site. Our method works as follows: First, we propose a newly designed Site Style Tree, to capture the common presentation styles and the actual contents of the pages in the given Web site. The tree structure, which is different from the one formerly proposed, is built by aligning pages of the site. For each node of SST, informative contents are discovered based on entropy and threshold method. The proposed approach is evaluated with two mining tasks, Web page clustering and classification. The experimental performance shows a significant improvement when compared to previous template detection approaches.
更多
查看译文
关键词
Template Detection, Information Extraction, Entropy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要