Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding

    EMNLP, pp. 250-260, 2018.

    Cited by: 4|Bibtex|Views23|Links
    EI
    Keywords:
    cluster consistentcross lingualLinguistic Data Consortiummultilingual correlationbilingual word embeddingMore(19+)
    Wei bo:
    We construct a common semantic space for multiple languages based on a cluster-consistent correlational neural network

    Abstract:

    We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space to enable knowledge and resource transfer across languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distri...More

    Code:

    Data:

    0
    Introduction
    • More than 3,000 languages have electronic record, e.g., at least a portion of the Christian Bible had been translated into 2,508 different languages.
    • Previous multilingual embedding methods align the semantic distributions of words from multiple languages within the common semantic space.
    • Though several recent attempts (Artetxe et al, 2017, 2018; Conneau et al, 2017) have shown that it is possible to extract multilingual word embedding from a pair of potentially unaligned corpora in multiple languages, the authors claim that it is necessary to impose more constraints to preserve linguistic properties and facilitate downstream NLP tasks, such as cross-lingual IE, and MT.
    • The authors design a new algorithm, called clusterconsistent multilingual word embedding, that extracts multilingual word embedding vectors which preserve the natural clustering structures of words across multiple languages
    Highlights
    • More than 3,000 languages have electronic record, e.g., at least a portion of the Christian Bible had been translated into 2,508 different languages
    • Though several recent attempts (Artetxe et al, 2017, 2018; Conneau et al, 2017) have shown that it is possible to extract multilingual word embedding from a pair of potentially unaligned corpora in multiple languages, we claim that it is necessary to impose more constraints to preserve linguistic properties and facilitate downstream natural language processing tasks, such as cross-lingual Information Extraction, and Machine Translation
    • We evaluate our approach on monolingual and multilingual QVEC (Tsvetkov et al, 2015) tasks, which measure the quality of word embeddings based on the alignment of the embeddings to linguistic feature vectors extracted from manually crafted linguistic resources, as well as an extrinsic evaluation on name tagging for low-resource languages
    • We briefly describe the basic model for learning the common semantic space: correlational neural networks (CorrNets) (Chandar et al, 2016; Rajendran et al, 2015)
    • In order to ensure the consistency of the neighborhoods within the common semantic space and make the cross-lingual mapping locally smooth, we propose to augment monolingual word representation with its top-N nearest neighboring words from the original monolingual semantic space.3
    • We construct a common semantic space for multiple languages based on a cluster-consistent correlational neural network
    Methods
    • Previous work (Ammar et al, 2016b; Duong et al, 2017) evaluated multilingual word embeddings on a series of intrinsic and extrinsic evaluation tasks.
    • In order to evaluate the quality of the multilingual embeddings, the authors use QVEC (Tsvetkov et al, 2015) tasks as the intrinsic evaluation platform.
    • For fair comparison with state-of-the-art methods on building multi-lingual embeddings (Ammar et al, 2016b; Duong et al, 2017), the authors use the same monolingual data and bilingual dictionaries as in their work.
    • The monolingual data for each language is the combination of the Leipzig Corpora Collection9 and Europarl.10 The bilingual dictionaries are the same as those used in Ammar et al (2016b).11
    Results
    • Using low-resource language name tagging as a case study for extrinsic evaluation, the approach achieves up to 14.6% absolute F-score gain over the state of the art on cross-lingual direct transfer.
    • Experiments demonstrate that the framework is effective at capturing linguistic properties and significantly outperforms state-of-the-art multi-lingual embedding learning methods
    Conclusion
    • The authors construct a common semantic space for multiple languages based on a cluster-consistent correlational neural network.
    • It combines word-level alignment and multi-level cluster alignment, including neighbor based clusters, character-level compositional word representations, and linguistic property based clusters induced from the readily available language-universal linguistic knowledge bases.
    • The authors will further extend the approach to multi-lingual multimedia common semantic space construction
    Summary
    • Introduction:

      More than 3,000 languages have electronic record, e.g., at least a portion of the Christian Bible had been translated into 2,508 different languages.
    • Previous multilingual embedding methods align the semantic distributions of words from multiple languages within the common semantic space.
    • Though several recent attempts (Artetxe et al, 2017, 2018; Conneau et al, 2017) have shown that it is possible to extract multilingual word embedding from a pair of potentially unaligned corpora in multiple languages, the authors claim that it is necessary to impose more constraints to preserve linguistic properties and facilitate downstream NLP tasks, such as cross-lingual IE, and MT.
    • The authors design a new algorithm, called clusterconsistent multilingual word embedding, that extracts multilingual word embedding vectors which preserve the natural clustering structures of words across multiple languages
    • Methods:

      Previous work (Ammar et al, 2016b; Duong et al, 2017) evaluated multilingual word embeddings on a series of intrinsic and extrinsic evaluation tasks.
    • In order to evaluate the quality of the multilingual embeddings, the authors use QVEC (Tsvetkov et al, 2015) tasks as the intrinsic evaluation platform.
    • For fair comparison with state-of-the-art methods on building multi-lingual embeddings (Ammar et al, 2016b; Duong et al, 2017), the authors use the same monolingual data and bilingual dictionaries as in their work.
    • The monolingual data for each language is the combination of the Leipzig Corpora Collection9 and Europarl.10 The bilingual dictionaries are the same as those used in Ammar et al (2016b).11
    • Results:

      Using low-resource language name tagging as a case study for extrinsic evaluation, the approach achieves up to 14.6% absolute F-score gain over the state of the art on cross-lingual direct transfer.
    • Experiments demonstrate that the framework is effective at capturing linguistic properties and significantly outperforms state-of-the-art multi-lingual embedding learning methods
    • Conclusion:

      The authors construct a common semantic space for multiple languages based on a cluster-consistent correlational neural network.
    • It combines word-level alignment and multi-level cluster alignment, including neighbor based clusters, character-level compositional word representations, and linguistic property based clusters induced from the readily available language-universal linguistic knowledge bases.
    • The authors will further extend the approach to multi-lingual multimedia common semantic space construction
    Tables
    • Table1: Examples of closed word classes and linguistic properties based clusters for English some examples of the word clusters we automatically extracted from CLDR and Wiktionary for English. The second type of word clusters is generated based on morphological information, including affixes that indicate various linguistic properties. These properties tend to be consistent across many languages. For example, “-like” is a suffix denoting “similar to” in English, while in Danish “-agtig” performs the same function. Wiktionary and Panlex include the affix alignments between English and any other languages. We filtered out the many-to-many affix alignments and obtained hundreds of alignments between each language and English. For each affix, we derive a set of word pairs (basic word, extended word with affix) by first selecting all the word pairs where basic word + affix = extended word, then ranking all word pairs based on the cosine similarity of their monolingual word embedding. Finally we select the top ranked 20 word pairs to form the cluster for each affix
    • Table2: Hyper-parameters
    • Table3: QVEC and QVEC-CCA scores. W: word alignment. N: neighbor based clustering and alignment. C: character based clustering and alignment. L: linguistic property based clustering and alignment
    • Table4: Results using bilingual lexicons with varying sizes (40,000, 10,000, 2,000, 1,000, 500, 250) and three languages. CorrNet W+N+C+L is the proposed approach with all the cluster types
    • Table5: Table 5
    • Table6: Comparison on Monolingual Embedding Quality: name tagging performance (F-score, %) using monolingual embedding and multilingual embeddings
    • Table7: Comparison on Cross-lingual Direct Transfer: name tagging performance (F-score, %) when the tagger was trained on 1-2 source languages and tested on a target language
    • Table8: Comparison on Cross-lingual Mutual Enhancement: name tagging performance (F-score, %) when the training set for the tagger was enhanced by annotated examples in other languages
    Download tables as Excel
    Related work
    Funding
    • This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA865017-C-9116, and U.S DARPA LORELEI Program # HR0011-15-C-0115
    Reference
    • Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A Smith. 2016a. Many languages, one parser. arXiv preprint arXiv:1602.01595.
      Findings
    • Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A Smith. 2016b. Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
      Findings
    • Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Proceedings of NIPS.
      Google ScholarLocate open access versionFindings
    • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297.
      Findings
    • Marco Baroni, Angeliki Lazaridou, and Georgiana Dinu. 2015. Hubness and pollution: Delving into cross-space mapping for zero-shot learning.
      Google ScholarFindings
    • Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. In Proceedings of COLING.
      Google ScholarLocate open access versionFindings
    • Sarath Chandar, Mitesh M Khapra, Hugo Larochelle, and Balaraman Ravindran. 2016. Correlational neural networks. Neural computation.
      Google ScholarFindings
    • Leon Cheung, Thamme Gowda, Ulf Hermjakob, Nelson Liu, Jonathan May, Alexandra Mayn, Nima Pourdamghani, Michael Pust, Kevin Knight, Nikolaos Malandrakis, et al. Elisa system description for lorehlt 2017.
      Google ScholarFindings
    • Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
      Findings
    • Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. arXiv preprint arXiv:1606.09403.
      Findings
    • Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    • Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    • Christiane Fellbaum. 1998. WordNet. Wiley Online Library.
      Google ScholarFindings
    • Xiaocheng Feng, Lifu Huang, Bing Qin, Ying Lin, Heng Ji, and Ting Liu. 2017. Multi-level crosslingual attentive neural architecture for low resource name tagging. Tsinghua Science and Technology, 22(6):633–645.
      Google ScholarLocate open access versionFindings
    • Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. Bilbowa: Fast bilingual distributed representations without word alignments. In Proceedings of ICML.
      Google ScholarLocate open access versionFindings
    • Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual models for compositional distributional semantics. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
      Findings
    • Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In Proceedings of AAAI.
      Google ScholarLocate open access versionFindings
    • Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
      Findings
    • Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. Deep multilingual correlation for improved word embeddings. In Proceedings of HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
      Findings
    • Pranava Swaroop Madhyastha and Cristina EspanaBonet. 2017. Learning bilingual projections of embeddings for vocabulary expansion in machine translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP.
      Google ScholarLocate open access versionFindings
    • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
      Findings
    • Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
      Findings
    • George A Miller, Claudia Leacock, Randee Tengi, and Ross T Bunker. 1993. A semantic concordance. In Proceedings of HLT.
      Google ScholarLocate open access versionFindings
    • Janarthanan Rajendran, Mitesh M Khapra, Sarath Chandar, and Balaraman Ravindran. 2015. Bridge correlational neural networks for multilingual multimodal representation learning. arXiv preprint arXiv:1510.03519.
      Findings
    • Sascha Rothe, Sebastian Ebert, and Hinrich Schutze. 2016. Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572.
      Findings
    • Holger Schwenk, Ke Tran, Orhan Firat, and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. arXiv preprint arXiv:1704.04154.
      Findings
    • Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
      Findings
    • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of NAACL-HLT.
      Google ScholarLocate open access versionFindings
    • Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Proceedings of HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    • Ivan Vulic and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Ivan Vulic and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel documentaligned data applied to bilingual lexicon induction. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Boliang Zhang, Ying Lin, Xiaoman Pan, Di Lu, Jonathan May, Kevin Knight, and Heng Ji. 2018.
      Google ScholarFindings
    • Elisa-edl: A cross-lingual entity extraction, linking and localization system. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 41–45.
      Google ScholarLocate open access versionFindings
    • Boliang Zhang, Xiaoman Pan, Ying Lin, Tongtao Zhang, Kevin Blissett, Samia Kazemi, Spencer Whitehead, Lifu Huang, and Heng Ji. 2017a. Rpi blender tac-kbp2017 13 languages edl system.
      Google ScholarFindings
    • Boliang Zhang, Xiaoman Pan, Ying Lin, Tongtao Zhang, Kevin Blissett, Samia Kazemi, Spencer Whitehead, Lifu Huang, and Heng Ji. 2017b. Rpi blender tac-kbp2017 13 languages edl system. In TAC.
      Google ScholarFindings
    • Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017c. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017d. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    • Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi S Jaakkola. 2016. Ten pairs to tagmultilingual pos tagging via coarse mapping between embeddings. In Proceedings of HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments