Main Content Detection in HTML Journal Articles

PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018)(2018)

引用 0|浏览58
暂无评分
摘要
Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.
更多
查看译文
关键词
web page segmentation, web content extraction, HTML 5
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要