Language Identification on Massive Datasets of Short Messages using an Attention Mechanism CNN

2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)(2020)

引用 7|浏览12
暂无评分
摘要
Language Identification (LID) is a challenging task, especially when the input texts are short and noisy such as microblog posts on social media or chat logs on gaming forums. The task has been tackled by either designing a feature set for a traditional classifier (e.g. Naive Bayes) or applying a deep neural network classifier (e.g. Bi-directional GRU, Encoder-Decoder). These methods are usually trained and tested on a private corpus, then used as off-the-shelf packages by other researchers on their own datasets, and consequently the various results published are not directly comparable. In this paper, we first create a new massive labeled dataset based on one year of Twitter data. We use this dataset to test several existing LID systems, in order to obtain a set of coherent benchmarks, and we make our dataset publicly available so that others can add to this set of benchmarks. Finally, we propose a shallow but efficient neural LID system, which is a ngram-regional convolution neural network enhanced with an attention mechanism. Experimental results show that our architecture is able to predict tens of thousands of samples per second and surpasses all state-of-the-art systems in accuracy and F1 score, including outperforming the popular langid system by 5%.
更多
查看译文
关键词
LID,NN,Data mining,corpus,AI
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要