FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths

WebDB'15: Proceedings of the 18th International Workshop on Web and Databases(2015)

引用 5|浏览0
暂无评分
摘要
Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.
更多
查看译文
关键词
focused object retrieval,significant tag paths,forest
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要