Language Identification
LOW RESOURCE SOCIAL MEDIA TEXT MINING(2021)
摘要
We introduce the language identification problem-a vital component in a multilingual text analysis pipeline. We discuss the document and word level formulations of the language identification task, briefly discuss supervised solutions, and then present low-supervision methods based on polyglot training that are highly applicable in low-resource settings. We then discuss code mixing, a linguistic phenomenon common in bilingual and multilingual speakers. We extend our language identification methods to model code mixing and measure the extent of English-Hindi code mixing in various social media data sets.
更多查看译文
关键词
Language identification, Supervised language identification, Unsupervised language identification, Word language identification, Polyglot document embeddings, Code mixing, Multilinguality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要