A hybrid approach for content extraction with text density and visual importance of DOM nodes

Knowledge and Information Systems(2013)

引用 26|浏览25
暂无评分
摘要
Additional contents in web pages, such as navigation panels, advertisements, copyrights and disclaimer notices, are typically not related to the main subject and may hamper the performance of Web data mining. They are traditionally taken as noises and need to be removed properly. To achieve this, two intuitive and crucial kinds of information—the textual information and the visual information of web pages—is considered in this paper. Accordingly, Text Density and Visual Importance are defined for the Document Object Model (DOM) nodes of a web page. Furthermore, a content extraction method with these measured values is proposed. It is a fast, accurate and general method for extracting content from diverse web pages. And with the employment of DOM nodes, the original structure of the web page can be preserved. Evaluated with the CleanEval benchmark and with randomly selected pages from well-known Web sites, where various web domains and styles are tested, the effect of the method is demonstrated. The average F1-scores with our method were 8.7 % higher than the best scores among several alternative methods.
更多
查看译文
关键词
Content extraction,Text density,Visual importance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要