Language engineering techniques for web archiving

msra

引用 22|浏览17
暂无评分
摘要
Advanced Information processing can enable automatic location of content on the Web, decisions making on its suitability for archiving and thus can ameliorate dramatically accuracy and efficiency for building of large scale Web Archive. This paper presents preliminary results from a research project (WATSON) aiming at adapting various Language Engineering technologies to facilitate large scale Web archiving as well as Web archives mining. The former includes pre-filtering and categorization of sites to define, based on criteria, a focus subset on the Web to be continuously crawled and site categorization to facilitate manual selection of important deep Web site. The result achieved in pre-filtering of commercial Web sites are 100% in precision for 70% of recall. A work station prototype aggregating useful information for professional is presented. The latter encompasses collections mining with emphasis on content evolution study and analysis of political discourse. We present results applied to the 2002 French election collection made by BnF.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要