Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering(2005)

引用 160|浏览0
暂无评分
摘要
Web pages驴especially dynamically generated ones驴contain several items that cannot be classified as the "primary content,驴 e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections驴 from the other content sections. We call these sections as "Web page blocks驴 or just "blocks.驴 First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.
更多
查看译文
关键词
noninformative content block,primary content,web page block,new algorithm,automatic identification,informative sections,web pages,web page,various web site,web cache system,primary content block,constituent web page block,thousand web page,text mining,feature extraction,search engines,classification,text analysis,content management,data mining,web mining,internet,indexing terms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要