Multilingual Word Embedding for Zero-Shot Text Classification

semanticscholar(2019)

引用 5|浏览7
暂无评分
摘要
For years political scientists have been developing tools to analyze text data, motivated by both the richness and wide availability of unstructured text as well as by the limited availability of accurate structured data. However, while many research questions are comparative and cross-national, we lack methods for analyzing multilingual corpora. Political scientists typically analyze texts from multilingual corpora separately and within the contexts of each individual language or by translating all texts into a common language before performing analysis. In this paper, we develop a Zero-shot Bilingual Classifier (0-BlinC), a novel multitask feed-forward neural network that utilizes cross-lingual information to facilitate text classification in multilingual corpora. Using a parallel bilingual corpus and training data in a single source language, 0-BlinC can perform quasi-sentence-level text classification in a target language for which no training labels are available. We demonstrate our method by measuring policy positions of party manifestos in English, Spanish, Bulgarian, Estonian, Italian, German and French using labeled text in English only. 0-BlinC is shown to outperform alternative methods that include the use of a machine translation service and pre-trained word vectors. ∗Post-doctoral Associate, Social Science Division, NYU Abu Dhabi. (yaoyao.dai@nyu.edu). †Assistant Professor, Department of Political Science and Public Administration, University of North Carolina at Charlotte.(benjamin.radford@gmail.com).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要