CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

IJCAI 2020, pp. 3853-3860, 2020.

Cited by: 0|Bibtex|Views407|Links
EI
Keywords:
multi lingualtarget languagesequence labelingshot cross lingualcode switchingMore(6+)
Weibo:
We proposed an augmentation framework to generate multilingual code-switching data to fine-tune mBERT for aligning representations from source and multiple target languages

Abstract:

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-...More
0
Introduction
  • Neural network models for NLP rely on the availability of labeled data for effective training [Yin et al, 2019].
  • 2.1 mBERT mBERT follows the same model architecture and training procedure as BERT [Devlin et al, 2019]
  • It adopts a 12 layer Transformer, but instead of training only on monolingual English data, it is trained on the Wikipedia pages of 104 languages with a shared word piece vocabulary, which allows the model to share embeddings across languages.
  • The authors feed the final hidden states of the input tokens into a softmax layer to classify the tokens.
  • The authors use the hidden state corresponding to the first sub-token as input to classify a word
Highlights
  • Neural network models for NLP rely on the availability of labeled data for effective training [Yin et al, 2019]
  • We proposed an augmentation framework to generate multilingual code-switching data to fine-tune mBERT for aligning representations from source and multiple target languages
Methods
  • To verify the effectiveness of the proposed dynamic augmentation mechanism, the authors make comparison with static augmentation method, in which the authors adopt Algorithm 1 to obtain augmented multilingual code-switch training data once for all the batches.
  • The authors find that the dynamic method outperforms the static method in all the tasks
  • The authors attribute this to the fact that the dynamic mechanism can generate more varying code-switched multi-lingual data within the batch training process while static method can only augment one time of origin training data.
  • Dynamic sampling allows the model to align more words representation closer in multiple languages
Results
  • The authors perform t-test for all experiments to measure whether the results from the proposed model are significantly better than the baselines.
  • The authors can observe that: 1) mBERT achieves strong performance on all zerosshot cross-lingual tasks, which demonstrates that mBERT is a surprisingly effective cross-lingual model for a wide range of NLP tasks.
  • This is consistent with the observation of Wu and Dredze [2019].
  • Note that the authors have not reproduced the results on XNLI task of original paper because of lacking the exact best hyper-parameters, which is mentioned on some issues on Github.1 So the authors run their open-source code 2 to obtain the results and the authors ap-
Conclusion
  • The authors proposed an augmentation framework to generate multilingual code-switching data to fine-tune mBERT for aligning representations from source and multiple target languages.
  • The authors' method is flexible and can be used to fine-tune all base encoder models.
  • Future work includes the application of CoSDA-ML on the task of multi-lingual language modeling task, so that a more general version of the multi-lingual contextual embedding can be investigated.
Summary
  • Introduction:

    Neural network models for NLP rely on the availability of labeled data for effective training [Yin et al, 2019].
  • 2.1 mBERT mBERT follows the same model architecture and training procedure as BERT [Devlin et al, 2019]
  • It adopts a 12 layer Transformer, but instead of training only on monolingual English data, it is trained on the Wikipedia pages of 104 languages with a shared word piece vocabulary, which allows the model to share embeddings across languages.
  • The authors feed the final hidden states of the input tokens into a softmax layer to classify the tokens.
  • The authors use the hidden state corresponding to the first sub-token as input to classify a word
  • Methods:

    To verify the effectiveness of the proposed dynamic augmentation mechanism, the authors make comparison with static augmentation method, in which the authors adopt Algorithm 1 to obtain augmented multilingual code-switch training data once for all the batches.
  • The authors find that the dynamic method outperforms the static method in all the tasks
  • The authors attribute this to the fact that the dynamic mechanism can generate more varying code-switched multi-lingual data within the batch training process while static method can only augment one time of origin training data.
  • Dynamic sampling allows the model to align more words representation closer in multiple languages
  • Results:

    The authors perform t-test for all experiments to measure whether the results from the proposed model are significantly better than the baselines.
  • The authors can observe that: 1) mBERT achieves strong performance on all zerosshot cross-lingual tasks, which demonstrates that mBERT is a surprisingly effective cross-lingual model for a wide range of NLP tasks.
  • This is consistent with the observation of Wu and Dredze [2019].
  • Note that the authors have not reproduced the results on XNLI task of original paper because of lacking the exact best hyper-parameters, which is mentioned on some issues on Github.1 So the authors run their open-source code 2 to obtain the results and the authors ap-
  • Conclusion:

    The authors proposed an augmentation framework to generate multilingual code-switching data to fine-tune mBERT for aligning representations from source and multiple target languages.
  • The authors' method is flexible and can be used to fine-tune all base encoder models.
  • Future work includes the application of CoSDA-ML on the task of multi-lingual language modeling task, so that a more general version of the multi-lingual contextual embedding can be investigated.
Tables
  • Table1: Natural Language Inference experiments
  • Table2: Sentiment classification experiments
  • Table3: Document classification experiments
  • Table4: Dialog State Tracking experiments
  • Table5: Slot filling and Intent detection experiments
Download tables as Excel
Related work
  • Zero-shot Cross-lingual Transfer. The main strands of work focused on learning cross-lingual word embeddings. Ruder et al [2017] surveyed methods [Klementiev et al, 2012; Kociskyet al., 2014; Guo et al, 2016] for learning cross-lingual word embeddings by either joint training or post-training mappings of monolingual embeddings. Xing et al [2015], Lample et al [2018] and Chen and Cardie [2018] proposed to take pre-trained monolingual word embeddings of different languages as input, aligning them into a shared semantic space. Our work follows in the recent line of cross-lingual contextualized embedding methods [Huang et al, 2019; Devlin et al, 2019; Wu and Dredze, 2019; Conneau and Lample, 2019; Artetxe et al, 2019], which are trained using masked language modeling or other auxiliary pre-training tasks to encourage representation in source and target language space closer, achieving state-of-the-art performance on a variety of zero-shot cross-lingual NLP tasks. We propose a data augmentation framework to dynamically construct multi-lingual code-switching data for training, which encourages model implicitly to align similar words in different languages into the same space.
Funding
  • This work was supported by the National Natural Science Foundation of China (NSFC) via grant 61976072, 61632011 and 61772153
Reference
  • [Artetxe and Schwenk, 2018] Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464, 2018.
    Findings
  • [Artetxe et al., 2019] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856, 2019.
    Findings
  • [Barnes et al., 2018] Jeremy Barnes, Roman Klinger, and Sabine Schulte im Walde. Bilingual sentiment embeddings: Joint projection of sentiment across languages. In Proc. of ACL, pages 2483–2493, Melbourne, Australia, July 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • [Chen and Cardie, 2018] Xilun Chen and Claire Cardie. Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933, 2018.
    Findings
  • [Chen et al., 2018] Wenhu Chen, Jianshu Chen, Yu Su, Xin Wang, Dong Yu, Xifeng Yan, and William Yang Wang. XL-NBT: A cross-lingual neural belief tracking framework. In Proc. of EMNLP, October-November 2018.
    Google ScholarLocate open access versionFindings
  • [Conneau and Lample, 2019] Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pages 7057–7067, 2019.
    Google ScholarLocate open access versionFindings
  • [Conneau et al., 2018] Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI: Evaluating crosslingual sentence representations. In Proc. of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • [Guo et al., 2016] Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. A representation learning framework for multi-source transfer parsing. In Proc. of AAAI, 2016.
    Google ScholarLocate open access versionFindings
  • [Huang et al., 2019] Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. Unicoder: A universal language encoder by pretraining with multiple cross-lingual tasks. In Proc. of EMNLP, November 2019.
    Google ScholarLocate open access versionFindings
  • [Klementiev et al., 2012] Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. Inducing crosslingual distributed representations of words. In Proc. of COLING, 2012.
    Google ScholarLocate open access versionFindings
  • [Kociskyet al., 2014] Tomas Kocisky, Karl Moritz Hermann, and Phil Blunsom. Learning bilingual word representations by marginalizing alignments. In Proc. of ACL, June 2014.
    Google ScholarLocate open access versionFindings
  • [Lample et al., 2018] Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Word translation without parallel data. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • [Liu et al., 2019a] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • [Liu et al., 2019b] Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, and Pascale Fung. Attention-informed mixed-language training for zero-shot cross-lingual taskoriented dialogue systems, 2019.
    Google ScholarFindings
  • [Mrksicet al., 2017] Nikola Mrksic, Ivan Vulic, Diarmuid O Seaghdha, Ira Leviant, Roi Reichart, Milica Gasic, Anna Korhonen, and Steve Young. Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics, 5:309–324, 2017.
    Google ScholarLocate open access versionFindings
  • [Ruder et al., 2017] Sebastian Ruder, Ivan Vulic, and Anders Søgaard. A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902, 2017.
    Findings
  • [Schuster et al., 2019a] Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. Cross-lingual transfer learning for multilingual task oriented dialog. In Proc. of NAACL, June 2019.
    Google ScholarLocate open access versionFindings
  • [Schuster et al., 2019b] Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. Cross-lingual alignment of contextual word embeddings, with applications to zeroshot dependency parsing. In Proc. of NAACL, June 2019.
    Google ScholarLocate open access versionFindings
  • [Schwenk and Li, 2018] Holger Schwenk and Xian Li. A corpus for multilingual document classification in eight languages. In Proceedings of the 11th Language Resources and Evaluation Conference, May 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2019] Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, and Ting Liu. Cross-lingual BERT transformation for zero-shot dependency parsing. In Proc. of EMNLP, November 2019.
    Google ScholarLocate open access versionFindings
  • [Wu and Dredze, 2019] Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proc. of EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • [Xing et al., 2015] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In Proc. of NAACL, 2015.
    Google ScholarLocate open access versionFindings
  • [Yin et al., 2019] Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dialog state tracking with reinforced data augmentation. arXiv preprint arXiv:1908.07795, 2019.
    Findings
  • [Yu et al., 2018] Katherine Yu, Haoran Li, and Barlas Oguz. Multilingual seq2seq training with similarity loss for cross-lingual document classification. In Proceedings of The Third Workshop on Representation Learning for NLP, July 2018.
    Google ScholarLocate open access versionFindings
  • [Zhang et al., 2019] Meishan Zhang, Yue Zhang, and Guohong Fu. Cross-lingual dependency parsing using codemixed TreeBank. In Proc. of EMNLP, pages 997–1006, Hong Kong, China, November 2019. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments