XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

ICML, pp. 4411-4421, 2020.

Cited by: 15|Bibtex|Views70|Links
EI
Keywords:
Machine Learningcross lingual transfercross lingual generalizationICMLmasked language modellingMore(3+)
Weibo:
While XTREME is still inherently limited by the data coverage of its constituent tasks for many low-resource languages, XTREME provides significantly broader coverage and more fine-grained analysis tools to encourage research on cross-lingual generalization ability of models

Abstract:

Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive eval...More

Code:

Data:

0
Introduction
  • In natural language processing (NLP), there is a pressing urgency to build systems that serve all of the world’s approximately 6,900 languages to overcome language barriers and enable universal information access for the world’s citizens (Ruder et al, 2019; Aharoni et al, 2019; Arivazhagan et al, 2019b).
  • Many languages have similarities in syntax or vocabulary, and multilingual learning approaches that train on multiple languages while leveraging the shared structure of the input space have begun to show promise as ways to alleviate data sparsity
  • Work in this direction focused on single tasks, such as grammar induction (Snyder et al, 2009), part-of-speech (POS) tagging (Tackstrom et al, 2013), parsing (McDonald et al, 2011), and text classification (Klementiev et al, 2012).
  • Despite the fact that such representations are intended to be general-purpose, evaluation of them has often been performed on a very limited and often disparate set of tasks—typically focusing on translation (Glavaset al., 2019; Lample & Conneau, 2019) and classification (Schwenk & Li, 2018; Conneau et al, 2018b)— and typologically similar languages (Conneau et al, 2018a)
Highlights
  • In natural language processing (NLP), there is a pressing urgency to build systems that serve all of the world’s approximately 6,900 languages to overcome language barriers and enable universal information access for the world’s citizens (Ruder et al, 2019; Aharoni et al, 2019; Arivazhagan et al, 2019b)
  • Over the last few years, there has been a move towards general-purpose multilingual representations that are applicable to many tasks, both on the word level (Mikolov et al, 2013; Faruqui & Dyer, 2014; Artetxe et al, 2017) or the full-sentence level (Devlin et al, 2019; Lample & Conneau, 2019)
  • Despite the fact that such representations are intended to be general-purpose, evaluation of them has often been performed on a very limited and often disparate set of tasks—typically focusing on translation (Glavaset al., 2019; Lample & Conneau, 2019) and classification (Schwenk & Li, 2018; Conneau et al, 2018b)— and typologically similar languages (Conneau et al, 2018a). To address this problem and incentivize research on truly general-purpose cross-lingual representation and transfer learning, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark
  • Training efficiency Tasks should be trainable on a single GPU for less than a day. This is to make the benchmark accessible, in particular to practitioners working with lowresource languages under resource constraints
  • While XTREME is still inherently limited by the data coverage of its constituent tasks for many low-resource languages, XTREME provides significantly broader coverage and more fine-grained analysis tools to encourage research on cross-lingual generalization ability of models
Methods
  • Task diversity Tasks should require multilingual models to transfer their meaning representations at different levels, e.g. words, phrases and sentences.
  • Training efficiency Tasks should be trainable on a single GPU for less than a day.
  • This is to make the benchmark accessible, in particular to practitioners working with lowresource languages under resource constraints.
Results
  • Overall results The authors show the main results in Table 2.
  • XLMR is the best-performing zero-shot transfer model and generally improves upon mBERT significantly.
  • The improvement is smaller, for the structured prediction tasks.
  • MMTE achieves performance competitive with mBERT on most tasks, with stronger results on XNLI, POS, and BUCC.
  • Model Avg. XQuAD MLQA TyDiQA-GoldP BUCC Tatoeba
Conclusion
  • As the authors have highlighted in the analysis, a model’s crosslingual transfer performance varies significantly both between tasks and languages.
  • While XTREME is still inherently limited by the data coverage of its constituent tasks for many low-resource languages, XTREME provides significantly broader coverage and more fine-grained analysis tools to encourage research on cross-lingual generalization ability of models.
  • The authors' future work is to package the demonstration codes for fine-tuning models on tasks in XTREME and provide analysis tools, and these will be released upon publication
Summary
  • Introduction:

    In natural language processing (NLP), there is a pressing urgency to build systems that serve all of the world’s approximately 6,900 languages to overcome language barriers and enable universal information access for the world’s citizens (Ruder et al, 2019; Aharoni et al, 2019; Arivazhagan et al, 2019b).
  • Many languages have similarities in syntax or vocabulary, and multilingual learning approaches that train on multiple languages while leveraging the shared structure of the input space have begun to show promise as ways to alleviate data sparsity
  • Work in this direction focused on single tasks, such as grammar induction (Snyder et al, 2009), part-of-speech (POS) tagging (Tackstrom et al, 2013), parsing (McDonald et al, 2011), and text classification (Klementiev et al, 2012).
  • Despite the fact that such representations are intended to be general-purpose, evaluation of them has often been performed on a very limited and often disparate set of tasks—typically focusing on translation (Glavaset al., 2019; Lample & Conneau, 2019) and classification (Schwenk & Li, 2018; Conneau et al, 2018b)— and typologically similar languages (Conneau et al, 2018a)
  • Methods:

    Task diversity Tasks should require multilingual models to transfer their meaning representations at different levels, e.g. words, phrases and sentences.
  • Training efficiency Tasks should be trainable on a single GPU for less than a day.
  • This is to make the benchmark accessible, in particular to practitioners working with lowresource languages under resource constraints.
  • Results:

    Overall results The authors show the main results in Table 2.
  • XLMR is the best-performing zero-shot transfer model and generally improves upon mBERT significantly.
  • The improvement is smaller, for the structured prediction tasks.
  • MMTE achieves performance competitive with mBERT on most tasks, with stronger results on XNLI, POS, and BUCC.
  • Model Avg. XQuAD MLQA TyDiQA-GoldP BUCC Tatoeba
  • Conclusion:

    As the authors have highlighted in the analysis, a model’s crosslingual transfer performance varies significantly both between tasks and languages.
  • While XTREME is still inherently limited by the data coverage of its constituent tasks for many low-resource languages, XTREME provides significantly broader coverage and more fine-grained analysis tools to encourage research on cross-lingual generalization ability of models.
  • The authors' future work is to package the demonstration codes for fine-tuning models on tasks in XTREME and provide analysis tools, and these will be released upon publication
Tables
  • Table1: Characteristics of the datasets in XTREME for the zero-shot transfer setting. For tasks that have training and dev sets in other languages, we only report the English numbers. We report the number of test examples per target language and the nature of the test sets (whether they are translations of English data or independently annotated). The number in brackets is the size of the intersection with our selected languages. For NER and POS, sizes are in sentences. Struct. pred.: structured prediction. Sent. retrieval: sentence retrieval
  • Table2: Overall results of baselines across all XTREME tasks. Translation-based baselines are not meaningful for sentence retrieval. We provide in-language baselines where target language training data is available. Note that for the QA tasks, translate-test performance is not directly comparable to the other scores as a small number of test questions were discarded and alignment is measured on the English data
  • Table3: The cross-lingual transfer gap (lower is better) of different models on XTREME tasks. The transfer gap is the difference between performance on the English test set and the average performance on the other languages. A transfer gap of 0 indicates perfect cross-lingual transfer. For the QA datasets, we only show EM scores. The average gaps are computed over the sentence classification and QA tasks
  • Table4: Accuracy of mBERT on POS tag trigrams and 4-grams in the target language dev data that appeared and did not appear in the English training data. We show the performance on English, the average across all other languages, and their difference
  • Table5: Statistics about languages in the cross-lingual benchmark. Languages belong to 12 language families and two isolates, with Indo-European (IE) having the most members. Diacritics / special characters: Language adds diacritics (additional symbols to letters). Compounding: Language makes extensive use of word compounds. Bound words / clitics: Function words attach to other words. Inflection: Words are inflected to represent grammatical meaning (e.g. case marking). Derivation: A single token can represent entire phrases or sentences
  • Table6: Hyper-parameters of baseline and state-of-the-art models. We do not use XLM-15 and XLM-R-Base in our experiments
  • Table7: Comparison of F1 and EM scores of mBERT and translate-train (mBERT) baselines on XQuAD test sets (gold), which were translated by professional translators and automatically translated test sets (auto)
  • Table8: Comparison of accuracy scores of mBERT baseline on XNLI test sets (gold), which were translated by professional translators and automatically translated test sets (auto) in 14 languages. BLEU and chrF scores are computed to measure the translation quality between gold and automatically translated test sets
  • Table9: Pearson correlation coefficients (ρ) of zero-shot transfer performance and Wikipedia size across datasets and models
  • Table10: Accuracy of mBERT on the target language dev data on POS tag trigrams and 4-grams that appeared and did not appear in the English training data. We show the average performance across all non-English languages and the difference of said average compared to the English performance on the bottom
  • Table11: Comparison of accuracies for entities in the target language NER dev data that were seen in the English NER training data (a); were not seen in the English NER training data (b); only consist of up to two tokens (c); only consist of Latin characters (d); and occur at least twice in the dev data (e). We only show languages where the sets (a–e) contain at least 100 entities each. We show the difference between (a) and (b) and the minimum difference between (a) and (c-e)
  • Table12: XNLI accuracy scores for each language. en ar bg de el es fr hi ru sw th tr ur vi zh avg
  • Table13: Tatoeba results (Accuracy) for each language
  • Table14: Three types of sentence embeddings from mBERT in BUCC tasks: (1) CLS token embeddings in the last layer; (2) Average word embeddings in the middle layers, i.e., Layer 6, 7, 8; (3) the concatenation of average word embeddings in the continuous four layers, i.e., Layer 1-4 (bottom layers), Layer 5-8 (middle layers), Layer 9-12 (top layers)
  • Table15: PAWS-X accuracy scores for each language
  • Table16: BUCC results (F1 scores) for each languages
  • Table17: XQuAD results (F1 / EM) for each language
  • Table18: TyDiQA-GoldP results (F1 / EM) for each language
  • Table19: MLQA results (F1 / EM) for each language
  • Table20: POS results (Accuracy) for each language
  • Table21: NER results (F1 Score) for each language
Download tables as Excel
Related work
Funding
  • JH and GN are sponsored by the Air Force Research Laboratory under agreement number FA8750-19-2-0200
Reference
  • Agic, Z. and Schluter, N. Baselines and test data for crosslingual inference. In Proceedings of LREC 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. arXiv preprint arXiv:1903.00089, 2019.
    Findings
  • Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., Cherry, C., Macherey, W., Chen, Z., and Wu, Y. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv preprint arXiv:1907.05019, 2019a.
    Findings
  • Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., Cherry, C., Macherey, W., Chen, Z., and Wu, Y. Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019, 2019b. URL http://arxiv.org/abs/1907.05019.
    Findings
  • Artetxe, M. and Schwenk, H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the ACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Artetxe, M., Labaka, G., and Agirre, E. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL 2017, pp. 451–462, 2017.
    Google ScholarLocate open access versionFindings
  • Artetxe, M., Labaka, G., and Agirre, E. A robust selflearning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798, 2018.
    Google ScholarLocate open access versionFindings
  • Artetxe, M., Ruder, S., and Yogatama, D. On the crosslingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856, 2019.
    Findings
  • Barnes, J., Klinger, R., and Schulte im Walde, S. Bilingual sentiment embeddings: Joint projection of sentiment across languages. In Proceedings of ACL 2018, pp. 2483–2493, Melbourne, Australia, 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. In Transactions of the Association of Computational Linguistics, 2020.
    Google ScholarLocate open access versionFindings
  • Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jegou, H. Word translation without parallel data. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018a.
    Google ScholarLocate open access versionFindings
  • Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of EMNLP 2018, pp. 2475–2485, 2018b.
    Google ScholarLocate open access versionFindings
  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. Unsupervised Crosslingual Representation Learning at Scale. arXiv preprint arXiv:1911.02116, 2019.
    Findings
  • Czarnowska, P., Ruder, S., Grave, E., Cotterell, R., and Copestake, A. Don’t forget the long tail! a comprehensive analysis of morphological generalization in bilingual lexicon induction. In Proceedings of EMNLP 2019, pp. 973–982, 2019.
    Google ScholarLocate open access versionFindings
  • Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Eisenschlos, J., Ruder, S., Czapla, P., Kadras, M., Gugger, S., and Howard, J. MultiFiT: Efficient Multi-lingual Language Model Fine-tuning. In Proceedings of EMNLP 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Eriguchi, A., Johnson, M., Firat, O., Kazawa, H., and Macherey, W. Zero-shot cross-lingual classification using multilingual neural machine translation. arXiv preprint arXiv:1809.04686, 2018.
    Findings
  • Faruqui, M. and Dyer, C. Improving vector space word representations using multilingual correlation. In Proceedings of EACL 2014, pp. 462–471, 2014.
    Google ScholarLocate open access versionFindings
  • Glavas, G., Litschko, R., Ruder, S., and Vulic, I. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. In Proceedings of ACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Gouws, S., Bengio, Y., and Corrado, G. BilBOWA: Fast bilingual distributed representations without word alignments. In Proceedings of ICML 2015, pp. 748–756, 2015.
    Google ScholarLocate open access versionFindings
  • Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., and Smith, N. A. Annotation Artifacts in Natural Language Inference Data. In Proceedings of NAACL-HLT 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Guzman, F., Chen, P.-J., Ott, M., Pino, J., Lample, G., Koehn, P., Chaudhary, V., and Ranzato, M. The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English. In Proceedings of EMNLP 2019, pp. 6100–6113, 2019.
    Google ScholarLocate open access versionFindings
  • Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of ACL 2018, pp. 328–339, 2018.
    Google ScholarLocate open access versionFindings
  • Hsu, T.-y., Liu, C.-l., and Lee, H.-y. Zero-shot Reading Comprehension by Cross-lingual Transfer Learning with Multi-lingual Language Representation Model. In Proceedings of EMNLP 2019, pp. 5935–5942, 2019.
    Google ScholarLocate open access versionFindings
  • Kementchedjhieva, Y., Hartmann, M., and Søgaard, A. Lost in evaluation: Misleading benchmarks for bilingual dictionary induction. In Proceedings of EMNLP 2019, pp. 3327–3332, 2019.
    Google ScholarLocate open access versionFindings
  • Klementiev, A., Titov, I., and Bhattarai, B. Inducing Crosslingual Distributed Representations of Words. In Proceedings of COLING 2012, 2012.
    Google ScholarLocate open access versionFindings
  • Koppel, M. and Ordan, N. Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1318–1326. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • Lample, G. and Conneau, A. Cross-lingual Language Model Pretraining. In Proceedings of NeurIPS 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Lee, K., Yoon, K., Park, S., and Hwang, S. W. Semisupervised training data generation for multilingual question answering. In Proceedings of LREC 2018, pp. 2758– 2762, 2018.
    Google ScholarLocate open access versionFindings
  • Lewis, P., Ouz, B., Rinott, R., Riedel, S., and Schwenk, H. MLQA: Evaluating Cross-lingual Extractive Question Answering. arXiv preprint arXiv:1910.07475, 2019.
    Findings
  • Lin, Y.-H., Chen, C.-Y., Lee, J., Li, Z., Zhang, Y., Xia, M., Rijhwani, S., He, J., Zhang, Z., Ma, X., Anastasopoulos, A., Littell, P., and Neubig, G. Choosing Transfer Languages for Cross-Lingual Learning. In Proceedings of ACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Luong, T., Pham, H., and Manning, C. D. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159, 2015.
    Google ScholarLocate open access versionFindings
  • McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305, 2017.
    Google ScholarLocate open access versionFindings
  • McDonald, R., Petrov, S., and Hall, K. Multi-source transfer of delexicalized dependency parsers. In Proceedings of EMNLP 2011, pp. 62–72, 2011.
    Google ScholarLocate open access versionFindings
  • Mikolov, T., Le, Q. V., and Sutskever, I. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
    Findings
  • Mohammad, S. M., Salameh, M., and Kiritchenko, S. How translation alters sentiment. Journal of Artificial Intelligence Research, 55:95–130, 2016.
    Google ScholarLocate open access versionFindings
  • Nangia, N. and Bowman, S. R. Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark. In Proceedings of ACL 2019, pp. 4566–4575, 2019.
    Google ScholarLocate open access versionFindings
  • Nivre, J., Abrams, M., Agic, Z., Ahrenberg, L., Antonsen, L., Aranzabe, M. J., Arutie, G., Asahara, M., Ateyah, L., Attia, M., et al. Universal dependencies 2.2. 2018.
    Google ScholarLocate open access versionFindings
  • Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., and Ji, H. Cross-lingual name tagging and linking for 282 languages. In Proceedings of ACL 2017, pp. 1946–1958, 2017.
    Google ScholarLocate open access versionFindings
  • Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of NAACL 2018, pp. 2227–2237, 2018.
    Google ScholarLocate open access versionFindings
  • Pires, T., Schlinger, E., and Garrette, D. How multilingual is Multilingual BERT? In Proceedings of ACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Popovic, M. chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392– 395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/ W15-3049. URL https://www.aclweb.org/anthology/W15-3049.
    Locate open access versionFindings
  • Rahimi, A., Li, Y., and Cohn, T. Massively Multilingual Transfer for NER. In Proceedings of ACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of EMNLP 2016, 2016.
    Google ScholarLocate open access versionFindings
  • Ruder, S., Vulic, I., and Søgaard, A. A Survey of Crosslingual Word Embedding Models. Journal of Artificial Intelligence Research, 65:569–631, 2019.
    Google ScholarLocate open access versionFindings
  • Schuster, T., Ram, O., Barzilay, R., and Globerson, A. Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing. In Proceedings of NAACL 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Schwenk, H. and Li, X. A Corpus for Multilingual Document Classification in Eight Languages. In Proceedings of LREC 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Siddhant, A., Johnson, M., Tsai, H., Arivazhagan, N., Riesa, J., Bapna, A., Firat, O., and Raman, K. Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation. arXiv preprint arXiv:1909.00437, 2019.
    Findings
  • Smith, L., Giorgi, S., Solanki, R., Eichstaedt, J., Schwartz, H. A., Abdul-Mageed, M., Buffone, A., and Ungar, L. Does well-beingtranslate on twitter? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2042–2047, 2016.
    Google ScholarLocate open access versionFindings
  • Snyder, B., Naseem, T., and Barzilay, R. Unsupervised multilingual grammar induction. In Proceedings of ACL 2009, pp. 73–81, 2009.
    Google ScholarLocate open access versionFindings
  • Tackstrom, O., Das, D., Petrov, S., McDonald, R., and Nivre, J. Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging. In Transactions of the Association for Computational Linguistics, 2013.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention Is All You Need. In Proceedings of NIPS 2017, 2017.
    Google ScholarLocate open access versionFindings
  • Vulic, I., Glavas, G., Reichart, R., and Korhonen, A. Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? In Proceedings of EMNLP 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. In Proceedings of NeurIPS 2019, 2019a.
    Google ScholarLocate open access versionFindings
  • Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of ICLR 2019, 2019b.
    Google ScholarLocate open access versionFindings
  • Williams, A., Nangia, N., and Bowman, S. R. A BroadCoverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of NAACL-HLT 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingfaces transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
    Findings
  • Wu, S. and Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of EMNLP 2019, 2019.
    Google ScholarLocate open access versionFindings
  • Yang, Y., Zhang, Y., Tar, C., and Baldridge, J. PAWS-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3685–3690, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1382. URL https://www.aclweb.org/anthology/D19-1382.
    Locate open access versionFindings
  • Zhang, M., Liu, Y., Luan, H., and Sun, M. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1934–1945, 2017.
    Google ScholarLocate open access versionFindings
  • Zhang, Y., Baldridge, J., and He, L. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of NAACL 2019, pp. 1298–1308, 2019.
    Google ScholarLocate open access versionFindings
  • Zweigenbaum, P., Sharoff, S., and Rapp, R. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pp. 60–67, 2017.
    Google ScholarLocate open access versionFindings
  • Zweigenbaum, P., Sharoff, S., and Rapp, R. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, pp. 39–42, 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments