compare-mt: A Tool for Holistic Comparison of Language Generation Systems

arXiv: Computation and Language, 2019.

Cited by: 18|Bibtex|Views69|Links
EI
Keywords:
salient patternLow Resource Languages for Emergent Incidentsdiagnostic evaluationmachine translation systemstatistical machine translation outputMore(16+)
Weibo:
We presented an open-source tool for holistic analysis of the results of machine translation or other language generation systems

Abstract:

In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or s...More

Code:

Data:

0
Introduction
  • Tasks involving the generation of natural language are ubiquitous in NLP, including machine translation (MT; Koehn (2010)), language generation from structured data (Reiter and Dale, 2000), summarization (Mani, 1999), dialog response generation (Oh and Rudnicky, 2000), image captioning (Mitchell et al, 2012).
  • Unlike tasks that involve prediction of a single label such as text classification, natural language texts are nuanced, and there are not clear yes/no distinctions about whether outputs are correct or not.
  • If a developer has some hypothesis about what phenomena their method should be helping with, they can develop scripts to automatically test these assumptions
  • This requires deep intuitions with respect to what changes to expect in advance, which cannot be taken for granted in beginning researchers
Highlights
  • Tasks involving the generation of natural language are ubiquitous in NLP, including machine translation (MT; Koehn (2010)), language generation from structured data (Reiter and Dale, 2000), summarization (Mani, 1999), dialog response generation (Oh and Rudnicky, 2000), image captioning (Mitchell et al, 2012)
  • As useful as these metrics are, they are often opaque: if we see, for example, that an machine translation model has achieved a gain in one BLEU point, this does not tell us what characteristics of the output have changed
  • Manual inspection of individual examples can be informative, but finding salient patterns for unusual phenomena requires perusing a large number of examples
  • We presented an open-source tool for holistic analysis of the results of machine translation or other language generation systems
  • One concrete future plan includes better integration with example-by-example analysis, but many more improvements will be made as the need arises
Conclusion
  • The authors presented an open-source tool for holistic analysis of the results of machine translation or other language generation systems.
  • It makes it possible to discover salient patterns that may help guide further analysis.
  • Compare-mt is evolving, and the authors plan to add more functionality as it becomes necessary to further understand cutting-edge techniques for MT.
  • One concrete future plan includes better integration with example-by-example analysis, but many more improvements will be made as the need arises
Summary
  • Introduction:

    Tasks involving the generation of natural language are ubiquitous in NLP, including machine translation (MT; Koehn (2010)), language generation from structured data (Reiter and Dale, 2000), summarization (Mani, 1999), dialog response generation (Oh and Rudnicky, 2000), image captioning (Mitchell et al, 2012).
  • Unlike tasks that involve prediction of a single label such as text classification, natural language texts are nuanced, and there are not clear yes/no distinctions about whether outputs are correct or not.
  • If a developer has some hypothesis about what phenomena their method should be helping with, they can develop scripts to automatically test these assumptions
  • This requires deep intuitions with respect to what changes to expect in advance, which cannot be taken for granted in beginning researchers
  • Conclusion:

    The authors presented an open-source tool for holistic analysis of the results of machine translation or other language generation systems.
  • It makes it possible to discover salient patterns that may help guide further analysis.
  • Compare-mt is evolving, and the authors plan to add more functionality as it becomes necessary to further understand cutting-edge techniques for MT.
  • One concrete future plan includes better integration with example-by-example analysis, but many more improvements will be made as the need arises
Tables
  • Table1: Aggregate score analysis with scores, confidence intervals, and pairwise significance tests
  • Table2: Examples discovered by n-gram analysis each system, and tries to find n-grams that each system is better at producing than the other (<a class="ref-link" id="cAkabe_et+al_2014_a" href="#rAkabe_et+al_2014_a">Akabe et al, 2014</a>). Specifically, it counts the number of times each system matches each ngram x, defined as m1(x) and m2(x) respectively, and calculates a smoothed probability of an n-gram match coming from one system or another
  • Table3: Sentence-by-sentence examples
Download tables as Excel
Funding
  • This work is sponsored in part by Defense Advanced Research Projects Agency Information Innovation Office (I2O) Program: Low Resource Languages for Emergent Incidents (LORELEI) under Contract No HR0011-15-C0114
Reference
  • Koichi Akabe, Graham Neubig, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2014. Discriminative language models as a tool for machine translation error analysis. In Proc. COLING, pages 1124–1132.
    Google ScholarLocate open access versionFindings
  • Wilker Aziz, Sheila Castilho, and Lucia Specia. 201Pet: a tool for post-editing and assessing machine translation. In Proc. LREC, pages 3982–3987.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. ICLR.
    Google ScholarLocate open access versionFindings
  • Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating Discourse Phenomena in Neural Machine Translation. In Proc. NAACL, pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrasebased machine translation quality: a case study. In Proc. EMNLP, pages 257–267, Austin, Texas. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alexandra Birch, Miles Osborne, and Phil Blunsom. 2010. Metrics for mt evaluation: evaluating reordering. Machine Translation, 24(1):15–26.
    Google ScholarLocate open access versionFindings
  • Konstantinos Chatzitheodorou and Stamatis Chatzistamatis. 2013. COSTA MT evaluation tool: An open toolkit for human machine translation evaluation. The Prague Bulletin of Mathematical Linguistics, 100(1):83–89.
    Google ScholarLocate open access versionFindings
  • David Chiang, Adam Lopez, Nitin Madnani, Christof Monz, Philip Resnik, and Michael Subotin. 2005. The hiero machine translation system: Extensions, evaluation, and analysis. In Proc. EMNLP, pages 779–786.
    Google ScholarLocate open access versionFindings
  • Steve DeNeefe, Kevin Knight, and Hayward H. Chan. 2005. Interactively exploring a machine translation model. In Proc. ACL, pages 97–100. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems. In Proc. WMT, pages 85–91.
    Google ScholarLocate open access versionFindings
  • Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1150–1159.
    Google ScholarLocate open access versionFindings
  • Ahmed El Kholy and Nizar Habash. 2011. Automatic error analysis for morphologically rich languages. In Proc. MT Summit, pages 225–232.
    Google ScholarLocate open access versionFindings
  • Christian Federmann. 2012. Appraise: an opensource toolkit for manual evaluation of mt output. The Prague Bulletin of Mathematical Linguistics, 98(1):25–35.
    Google ScholarLocate open access versionFindings
  • Mark Fishel, Rico Sennrich, Maja Popovic, and Ondrej Bojar. 2012. Terrorcat: a translation error categorization-based mt quality metric. In Proc. WMT, pages 64–70.
    Google ScholarLocate open access versionFindings
  • Mary Flanagan. 1994. Error classification for mt evaluation. In Proc. AMTA, pages 65–72.
    Google ScholarLocate open access versionFindings
  • Meritxell Gonzalez, Jesus Gimenez, and Lluıs Marquez. 2012. A graphical interface for mt evaluation and error analysis. In Proceedings of the ACL 2012 System Demonstrations, pages 139–144.
    Google ScholarLocate open access versionFindings
  • Pierre Isabelle, Colin Cherry, and George Foster. 20A challenge set approach to evaluating machine translation. In Proc. EMNLP, pages 2476–2486.
    Google ScholarLocate open access versionFindings
  • Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. 2010. Automatic evaluation of translation quality for distant language pairs. In Proc. EMNLP, pages 944–952.
    Google ScholarLocate open access versionFindings
  • Ondrej Klejch, Eleftherios Avramidis, Aljoscha Burchardt, and Martin Popel. 2015. Mt-compareval: Graphical evaluation interface for machine translation development. The Prague Bulletin of Mathematical Linguistics, 104(1):63–74.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proc. EMNLP, pages 388–395.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2010. Statistical Machine Translation. Cambridge Press.
    Google ScholarFindings
  • Phillip Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. HLT, pages 48–54.
    Google ScholarLocate open access versionFindings
  • Sachin Kumar and Yulia Tsvetkov. 2019. Von misesfisher loss for training sequence to sequence models with continuous outputs. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. 2017. Interactive visualization and manipulation of attention-based neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 121–126, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81.
    Google ScholarLocate open access versionFindings
  • Adam Lopez and Philip Resnik. 2005. Pattern visualization for machine translation output. In Proceedings of HLT/EMNLP 2005 Interactive Demonstrations, pages 12–13.
    Google ScholarLocate open access versionFindings
  • Nitin Madnani. 2011. ibleu: Interactively debugging and scoring statistical machine translation systems. In 2011 IEEE Fifth International Conference on Semantic Computing, pages 213–214. IEEE.
    Google ScholarLocate open access versionFindings
  • Inderjeet Mani. 1999. Advances in automatic text summarization. MIT press.
    Google ScholarFindings
  • Paul Michel and Graham Neubig. 2018. MTNT: A testbed for machine translation of noisy text. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg, and Hal Daume III. 2012. Midge: Generating image descriptions from computer vision detections. In Proc. EACL, pages 747–756.
    Google ScholarLocate open access versionFindings
  • Saif M Mohammad, Mohammad Salameh, and Svetlana Kiritchenko. 2016. How translation alters sentiment. Journal of Artificial Intelligence Research, 55:95–130.
    Google ScholarLocate open access versionFindings
  • Mathias Muller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Proc. WMT, pages 61–72, Belgium, Brussels. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Masaki Murata, Kiyotaka Uchimoto, Qing Ma, Toshiyuki Kanamaru, and Hitoshi Isahara. 2005. Analysis of machine translation systems’ errors in tense, aspect, and modality. In Proc. PACLIC.
    Google ScholarLocate open access versionFindings
  • Sudip Kumar Naskar, Antonio Toral, Federico Gaspari, and Andy Way. 2011. A framework for diagnostic evaluation of mt based on linguistic checkpoints. Proc. MT Summit, pages 529–536.
    Google ScholarLocate open access versionFindings
  • Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proc. EMNLP, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Alice H Oh and Alexander I Rudnicky. 2000. Stochastic language generation for spoken dialogue systems. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational systems-Volume 3, pages 27–32. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Maja Popovic. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Maja Popovic, Adriade Gispert, Deepa Gupta, Patrik Lambert, Hermann Ney, Jose B. Marino, Marcello Federico, and Rafael Banchs. 2006. Morphosyntactic information for automatic error analysis of statistical machine translation output. In Proc. WMT, pages 1–6.
    Google ScholarLocate open access versionFindings
  • Maja Popovicand Hermann Ney. 2007. Word error rates: Decomposition over POS classes and applications for error analysis. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 48–55.
    Google ScholarLocate open access versionFindings
  • Maja Popovicand Hermann Ney. 2011. Towards automatic error analysis of machine translation output. Computational Linguistics, 37(4):657–688.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proc. NAACL, New Orleans, USA.
    Google ScholarLocate open access versionFindings
  • Ehud Reiter and Robert Dale. 2000. Building natural language generation systems. Cambridge university press.
    Google ScholarFindings
  • Devendra Sachan and Graham Neubig. 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proc. WMT, Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich. 2017. How Grammatical is Characterlevel Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs. In Proc. EACL, pages 376–382, Valencia, Spain.
    Google ScholarLocate open access versionFindings
  • Sara Stymne. 2011. BLAST: A tool for error analysis of machine translation output. In Proceedings of the ACL-HLT 2011 System Demonstrations, pages 56– 61.
    Google ScholarLocate open access versionFindings
  • David Vilar, Jia Xu, Luis Fernando d’Haro, and Hermann Ney. 2006. Error analysis of statistical machine translation output. In Proc. LREC, pages 697– 702.
    Google ScholarLocate open access versionFindings
  • Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018. A tree-based decoder for neural machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.
    Google ScholarLocate open access versionFindings
  • Jonathan Weese and Chris Callison-Burch. 2010. Visualizing data structures in parsing-based machine translation. The Prague Bulletin of Mathematical Linguistics, 93:127–136.
    Google ScholarLocate open access versionFindings
  • Daniel Zeman, Mark Fishel, Jan Berka, and Ondrej Bojar. 2011. Addicter: What is wrong with my translations? The Prague Bulletin of Mathematical Linguistics, 96(1):79–88.
    Google ScholarLocate open access versionFindings
  • Ming Zhou, Bo Wang, Shujie Liu, Mu Li, Dongdong Zhang, and Tiejun Zhao. 2008. Diagnostic evaluation of machine translation systems using automatically constructed linguistic check-points. In Proc. COLING, pages 1121–1128, Manchester, UK. Coling 2008 Organizing Committee.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments