Distilling Informative Content from HTML News Pages

Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences(2009)

引用 6|浏览2
暂无评分
摘要
Not only the Web abounds of information overload, but also its component molecules, the Web documents contained therein. In particular HTML news pages have become aggregates of cornucopian information blocks, such as advertisements, link lists, disclaimers and terms of use, or comments from readers. Thus, only a small fraction of all textual content appears dedicated to the actual news article itself. The amalgamation of relevant content with page clutter poses considerable concerns to applications that make use of such news information, such as search engines. We present an approach geared towards dissecting relevant from irrelevant textual content in an automated fashion. Our system extracts linguistic and structural features from merged text segments and applies various classifiers thereafter. We have conducted empirical analyses in order to compare our approach's classification performance with a human gold standard as well as two benchmark systems.
更多
查看译文
关键词
html news pages,web document,particular html news page,news information,cornucopian information block,textual content,distilling informative content,irrelevant textual content,actual news article,automated fashion,relevant content,information overload,search engines,data mining,information content,text segmentation,gold standard,search engine,intelligent agent,html,gold
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要