Jhu/Apl Experiments In Tokenization And Non-Word Translation

COMPARATIVE EVALUATION OF MULTILINGUAL INFORMATION ACCESS SYSTEMS(2003)

引用 35|浏览37
暂无评分
摘要
In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization. particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer: a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance;, various lengths of n-grams: and the use. of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolineual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=55) outperform a popular stemmer in non-Romance languages. that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems. or 4-grams. and that a combination of indexing methods is best of all.
更多
查看译文
关键词
indexation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要