AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods

Statistical Significance Tests for Machine Translation Evaluation

EMNLP, pp.388-395, (2004)

Cited by: 1060|Views157
EI
Full Text
Bibtex
Weibo

Abstract

If two translation systems differ differ in perfor- mance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling meth- ods to compute statistical significance of test results, and validate them on the concrete example of the BLEU score. Even for small ...More

Code:

Data:

0
Introduction
  • The field of machine translation has been changed by the emergence both of effective statistical methods to automatically train machine translation systems from translated text sources and of reliable automatic evaluation methods.

    Machine translation systems can be built and evaluated from black box tools and parallel corpora, with no human involvement at all.

    The evaluation of machine translation systems has changed dramatically in the last few years.
  • Machine translation systems can be built and evaluated from black box tools and parallel corpora, with no human involvement at all.
  • The evaluation of machine translation systems has changed dramatically in the last few years.
  • It is desirable to have fluent output that can be read
  • These two goals, adequacy and fluency, are the main criteria in machine translation evaluation
Highlights
  • The field of machine translation has been changed by the emergence both of effective statistical methods to automatically train machine translation systems from translated text sources and of reliable automatic evaluation methods
  • Instead of reporting human judgment of translation quality, researchers rely on automatic measures, most notably the BLEU score, which measures n-gram overlap with reference translations
  • Since it has been shown that the BLEU score correlates with human judgment, an improvement in BLEU is taken as evidence for improvement in translation quality
  • Phrase-based machine translation systems make use of a language model trained for the target language and a translation model trained from a parallel corpus
  • Say, one system outperforms the other system 95% of the time, we draw the conclusion that it is better with 95% statistical significance
  • We described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods
Results
  • 4. The Spanish system is better by up to 4%.
  • The purpose of experimental testing is to assess the true translation quality of a system on text from a certain domain.
  • The authors will always just be able to measure the performance of the system on a specific sample.
  • From this test result, the authors would like to conclude what the true translation performance is
Conclusion
  • Summary and Outlook

    Having a trusted experimental framework is essential for drawing conclusions on the effects of system changes.
  • One important element of a solid experimental framework is a statistical significance test that allows them to judge, if a change in score that comes from a change in the system, truly reflects a change in overall translation quality.
  • The authors applied bootstrap resampling to machine translation evaluation and described methods to compute statistical significance intervals and levels for machine translation evaluation metrics.
  • The authors described how to compute statistical significance intervals for metrics such as BLEU for small test sets, using bootstrap resampling methods.
  • The authors provided empirical evidence that the computed intervals are accurate
Tables
  • Table1: Translation quality of three systems, measured by the BLEU score and n-gram precision. Six different systems are compared here (we will get later into the nature of these systems). While the unigram precision of the three systems hovers around 60%, the difference in 4-gram precision is much larger. The Finnish system has only roughly half (7.8%) of the 4-gram precision of the Spanish system (14.7%). This is the cause for the relative large distance in overall BLEU (28.9% vs. 20.2%)1. Higher n-grams (and we could go beyond 4), measure not only syntactic cohesion and semantic adequacy of the output, but also give discriminatory power to the metric
  • Table2: Values for Ø for different sizes and significance levels (Formula 5)
  • Table3: The table displays how often a conclusion with 95% statistical significance is made for different system comparisons and different sample sizes. 12%/1% means 12% correct and 1% wrong conclusions. 30,000 test sentences are divided into 300, 100, 50, and 10 samples, each the size of 100, 300, 600, and 3000 sentences respectively
  • Table4: Validation of the statistical significance estimations: Number of conclusions drawn at a certain level and accuracy of the conclusions
Download tables as Excel
Reference
  • Brown, P., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Rossin, P. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):76–85.
    Google ScholarLocate open access versionFindings
  • Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2):263– 313.
    Google ScholarLocate open access versionFindings
  • Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press.
    Google ScholarFindings
  • Germann, U. (2003). Greedy decoding for statistical machine translation in almost linear time. In Proceedings of HLT-NAACL.
    Google ScholarLocate open access versionFindings
  • Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation. Unpublished, http://www.isi.edu/koehn/europarl/.
    Locate open access versionFindings
  • Koehn, P. (2004). Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of AMTA.
    Google ScholarLocate open access versionFindings
  • Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase based translation. In Proceedings of HLTNAACL.
    Google ScholarLocate open access versionFindings
  • Kumar, S. and Byrne, W. (2004). Minimum bayes-risk decoding for statistical machine translation. In Proceedings of HLT-NAACL.
    Google ScholarLocate open access versionFindings
  • Melamed, I. D., Green, R., and Turian, J. P. (2003). Precision and recall of machine translation. In Proceedings of HLT-NAACL.
    Google ScholarLocate open access versionFindings
  • Och, F. J. (2002). Statistical Machine Translation: From Single-Word Models to Alignment Templates. PhD thesis, RWTH Aachen, Germany.
    Google ScholarFindings
  • Och, F. J. (2003). Minimum error rate training for statistical machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2002). Numerical Recipes in C++. Cambridge University Press.
    Google ScholarFindings
  • Tillmann, C. (2003). A projection extension algorithm for statistical machine translation. In Collins, M. and Steedman, M., editors, Proceedings of EMNLP, pages 1–8.
    Google ScholarLocate open access versionFindings
  • Vogel, S., Zhang, Y., Huang, F., Tribble, A., Venugopal, A., Zhao, B., and Waibel, A. (2003). The CMU statistical machine translation system. In Proceedings of MT Summit IX.
    Google ScholarLocate open access versionFindings
  • Zens, R., Och, F. J., and Ney, H. (2002). Phrasebased statistical machine translation. In Proceedings of the German Conference on Artificial Intelligence (KI 2002).
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科