Parallel sentences mining from the web

Journal of Computational Information Systems(2009)

引用 2|浏览17
暂无评分
摘要
Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, alignment and text location features. Training sets and the mining seeds are acquired automatically. Experiment shows satisfactory parallel resource mining performance. 1553-9105/ Copyright © 2009 Binary Information Press.
更多
查看译文
关键词
bilingual content extraction,bilingual web page selection,parallel sentence verification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要