Polyglot Parsing for One Thousand and One Languages (And Then Some)

First workshop on Typology for Polyglot NLP, Florence, Italy, August 1 2019(2019)

引用 2|浏览26
暂无评分
摘要
Cross-lingual model transfer (Zeman and Resnik, 2008; McDonald et al., 2011) is a commonly used technique for parsing low-resource languages, which relies on the existence of pivot features, such as universal part-of-speech tags or crosslingual word embeddings. In order for the technique to be really successful, it must also be possible to identify one or more suitable source languages, a task for which language similarity metrics have been exploited (Rosa and Zabokrtsky, 2015). When training parsers on multiple languages, whether for the purpose of model transfer or not, recent studies have also shown that it is beneficial to encode information about language similarity in the form of embeddings, which can be initialized using typological information (Ammar et al., 2016; Smith et al., 2018). In this project, we try to combine these techniques on an unprecedented scale by building a parser for 1266 low-resource languages, using the following resources:• Treebanks for 27 languages from Universal Dependencies (Nivre et al., 2016).• Pre-trained word embeddings for a mostly overlapping set of 27 languages from Facebook (Bojanowski et al., 2016) aligned into a multilingual space (Smith et al., 2017).• A parallel corpus of Bible translations in the high-resource languages and 1266 additional languages (Mayer and Cysouw, 2014).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要