Learning to Extract Content from News Webpages

Bradford(2009)

引用 11|浏览0
暂无评分
摘要
We consider the problem of content extraction from online news Web pages. To explore to what extent the syntactic markup and the visual structure of a Web page facilitate the extraction of its content, we compare two state-of-the-art classifiers as first instantiations of a general framework that allows for proper model comparison. To this end, we introduce the publicly available NEWS600 corpus, a set of 604 real world news Web pages which have been annotated with 30 semantic labels. An empirical analysis of the two models on this dataset shows that the inclusion of structural information is indeed advantageous.
更多
查看译文
关键词
dataset shows,semantic label,proper model comparison,empirical analysis,real world news webpages,extract content,general framework,online news webpages,news webpages,state-of-the-art classifier,news600 corpus,content extraction,conditional random field,content management,web content mining,scattering,speech,classification,feature extraction,accuracy,data mining,information analysis,support vector machines,information retrieval,navigation,conditional random fields,random processes,labeling,probability density function
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要