Automatic Web Content Extraction By Combination Of Learning And Grouping

WWW '15: 24th International World Wide Web Conference Florence Italy May, 2015(2015)

引用 48|浏览12
暂无评分
摘要
Web pages consist of not only actual content, but also other elements such as branding banners, navigational elements, advertisements, copyright etc. This noisy content is typically not related to the main subjects of the webpages. Identifying the part of actual content, or clipping web pages, has many applications, such as high quality web printing, e-reading on mobile devices and data mining. Although there are many existing methods attempting to address this task, most of them can either work only on certain types of Web pages, e.g. article pages, or has to develop different models for different websites. We formulate the actual content identifying problem as a DOM tree node selection problem. We develop multiple features by utilizing the DOM tree node properties to train a machine learning model. Then candidate nodes are selected based on the learning model. Based on the observation that the actual content is usually located in a spatially continuous block, we develop a grouping technology to further filter out noisy data and pick missing data for the candidate nodes. We conduct extensive experiments on a real dataset and demonstrate our solution has high quality outputs and outperforms several baseline methods.
更多
查看译文
关键词
Content Extraction,Information Extraction,Web Page Segmentation,Noise Removal,Information Retrieval,Web Mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要