Machine Normalization

ACM Transactions on Asian and Low-Resource Language Information Processing(2020)

引用 1|浏览4
暂无评分
摘要
User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要