A Novel Method for Bilingual Web Page Mining Via Search Engines

Journal of Chinese Information Processing(2011)

引用 23|浏览21
暂无评分
摘要
A new approach has been developed for acquiring bilingual web pages from the result pages of search engines,which is composed of two challenging tasks.The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page.Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition.The task of high-quality bilingual web page acquisition is assumed as a classification problem.One advantage of our approach is that it is independent of the search engine and the domain.The test is based on 2 516 records extracted from six search engines automatically and annotated manually,which gets a high precision of 81.3% and a recall of 94.93%.The experimental results indicate that our approach is very effective.
更多
查看译文
关键词
parallel corpora,bilingual web pages,web mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要