Language Identification

LOW RESOURCE SOCIAL MEDIA TEXT MINING(2021)

引用 2|浏览12
暂无评分
摘要
We introduce the language identification problem-a vital component in a multilingual text analysis pipeline. We discuss the document and word level formulations of the language identification task, briefly discuss supervised solutions, and then present low-supervision methods based on polyglot training that are highly applicable in low-resource settings. We then discuss code mixing, a linguistic phenomenon common in bilingual and multilingual speakers. We extend our language identification methods to model code mixing and measure the extent of English-Hindi code mixing in various social media data sets.
更多
查看译文
关键词
Language identification, Supervised language identification, Unsupervised language identification, Word language identification, Polyglot document embeddings, Code mixing, Multilinguality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要