Archiving Data Objects using Web Feeds

msra(2010)

引用 24|浏览26
暂无评分
摘要
In this paper, we show how Web feeds can be used to archive Web pages that contain temporal data objects, such as blog posts or news items. We use RSS or Atom feeds to extract these Web objects and to detect change in the context of an incremental crawl. We first describe some statistics on Web feeds, by studying the evolution of a collection of feeds for a period of time and observing their temporal aspects. For detecting change on crawled Web pages that have a Web feed associated, we present an algorithm that extracts the information of interest (the data object), with the aim of analyzing changes effectively, without being tricked by possible changes in the surrounding boilerplate. Our algorithm applies a bottom-up strategy on the HTML DOM tree and uses n -grams extracted from the title and the description of a feed item to match conceptual leaf nodes in the HTML page. These conceptual nodes will be clustered in function of their lowest block-level common ancestor. The resulting block-level nodes will correspond to semantic zones in the Web page, and by taking the one that is the most semantically dense, the algorithm identifies the node that acts like a wrapper for the article. We extract then the textual content and the references of the article from this node and encapsulate the result in a timely unit. Experiments are done * This research was funded by the European Research Council grant Webdam FP7-ICT-226513.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要