AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods
Statistical Significance Tests for Machine Translation Evaluation
EMNLP, pp.388-395, (2004)
If two translation systems differ differ in perfor- mance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling meth- ods to compute statistical significance of test results, and validate them on the concrete example of the BLEU score. Even for small ...More
PPT (Upload PPT)
- The field of machine translation has been changed by the emergence both of effective statistical methods to automatically train machine translation systems from translated text sources and of reliable automatic evaluation methods.
Machine translation systems can be built and evaluated from black box tools and parallel corpora, with no human involvement at all.
The evaluation of machine translation systems has changed dramatically in the last few years.
- Machine translation systems can be built and evaluated from black box tools and parallel corpora, with no human involvement at all.
- The evaluation of machine translation systems has changed dramatically in the last few years.
- It is desirable to have fluent output that can be read
- These two goals, adequacy and fluency, are the main criteria in machine translation evaluation
- The field of machine translation has been changed by the emergence both of effective statistical methods to automatically train machine translation systems from translated text sources and of reliable automatic evaluation methods
- Instead of reporting human judgment of translation quality, researchers rely on automatic measures, most notably the BLEU score, which measures n-gram overlap with reference translations
- Since it has been shown that the BLEU score correlates with human judgment, an improvement in BLEU is taken as evidence for improvement in translation quality
- Phrase-based machine translation systems make use of a language model trained for the target language and a translation model trained from a parallel corpus
- Say, one system outperforms the other system 95% of the time, we draw the conclusion that it is better with 95% statistical significance
- We described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods
- 4. The Spanish system is better by up to 4%.
- The purpose of experimental testing is to assess the true translation quality of a system on text from a certain domain.
- The authors will always just be able to measure the performance of the system on a specific sample.
- From this test result, the authors would like to conclude what the true translation performance is
- Summary and Outlook
Having a trusted experimental framework is essential for drawing conclusions on the effects of system changes.
- One important element of a solid experimental framework is a statistical significance test that allows them to judge, if a change in score that comes from a change in the system, truly reflects a change in overall translation quality.
- The authors applied bootstrap resampling to machine translation evaluation and described methods to compute statistical significance intervals and levels for machine translation evaluation metrics.
- The authors described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods.
- The authors provided empirical evidence that the computed intervals are accurate
- Table1: Translation quality of three systems, measured by the BLEU score and n-gram precision. Six different systems are compared here (we will get later into the nature of these systems). While the unigram precision of the three systems hovers around 60%, the difference in 4-gram precision is much larger. The Finnish system has only roughly half (7.8%) of the 4-gram precision of the Spanish system (14.7%). This is the cause for the relative large distance in overall BLEU (28.9% vs. 20.2%)1. Higher n-grams (and we could go beyond 4), measure not only syntactic cohesion and semantic adequacy of the output, but also give discriminatory power to the metric
- Table2: Values for Ø for different sizes and significance levels (Formula 5)
- Table3: The table displays how often a conclusion with 95% statistical significance is made for different system comparisons and different sample sizes. 12%/1% means 12% correct and 1% wrong conclusions. 30,000 test sentences are divided into 300, 100, 50, and 10 samples, each the size of 100, 300, 600, and 3000 sentences respectively
- Table4: Validation of the statistical significance estimations: Number of conclusions drawn at a certain level and accuracy of the conclusions
- Brown, P., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Rossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):76–85.
- Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2):263– 313.
- Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press.
- Germann, U. (2003). Greedy decoding for statistical machine translation in almost linear time. In Proceedings of HLT-NAACL.
- Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation. Unpublished, http://www.isi.edu/koehn/europarl/.
- Koehn, P. (2004). Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of AMTA.
- Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase based translation. In Proceedings of HLTNAACL.
- Kumar, S. and Byrne, W. (2004). Minimum bayes-risk decoding for statistical machine translation. In Proceedings of HLT-NAACL.
- Melamed, I. D., Green, R., and Turian, J. P. (2003). Precision and recall of machine translation. In Proceedings of HLT-NAACL.
- Och, F. J. (2002). Statistical Machine Translation: From Single-Word Models to Alignment Templates. PhD thesis, RWTH Aachen, Germany.
- Och, F. J. (2003). Minimum error rate training for statistical machine translation. In Proceedings of ACL.
- Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL.
- Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2002). Numerical Recipes in C++. Cambridge University Press.
- Tillmann, C. (2003). A projection extension algorithm for statistical machine translation. In Collins, M. and Steedman, M., editors, Proceedings of EMNLP, pages 1–8.
- Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B., and Waibel, A. (2003). The CMU statistical machine translation system. In Proceedings of MT Summit IX.
- Zens, R., Och, F. J., and Ney, H. (2002). Phrasebased statistical machine translation. In Proceedings of the German Conference on Artificial Intelligence (KI 2002).