Language Identification for User Generated Content in Social Media

Information Systems and Technologies to Support LearningSmart Innovation, Systems and Technologies(2018)

引用 2|浏览1
暂无评分
摘要
In this paper we address the problem of Language Identification (LID) of user generated content in Social Media Communication (SMC). The existent LID solutions are very accurate in standard languages and normal texts. However, for non standard ones (i.e. SMC) this is still unreachable. To help resolve this problem, we present a language independent LID solution for non standard use of language, where we combine linguistic tools (morphology analyzers) and statistical models (language models) in a hybrid approach to identify the standard and non standard languages included in these SMC texts. Our solution treats also the Code Switching phenomenon between standard languages and dialect as well as the normalization of SMC special expressions and dialect, and finally the spelling correction of OOV words.
更多
查看译文
关键词
language identification,user generated content,social media
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要