Document Structure Meets Page Layout: Loopy Random Fields For Web News Content Extraction

DOCENG(2010)

引用 10|浏览18
暂无评分
摘要
Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.
更多
查看译文
关键词
Web Content Extraction,Loopy Conditional Random Fields,Tree-based Reparameterization,News600 Data Set
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要