On Character Vs Word Embeddings As Input For English Sentence Classification

INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 1(2019)

引用 0|浏览121
暂无评分
摘要
It has become a common practice to use word embeddings, such as those generated by word2vec or GloVe, as inputs for natural language processing tasks. Such embeddings can aid generalisation by capturing statistical regularities in word usage and by capturing some semantic information. However they require the construction of large dictionaries of high-dimensional vectors from very large amounts of text and have limited ability to handle out-of-vocabulary words or spelling mistakes. Some recent work has demonstrated that text classifiers using character-level input can achieve similar performance to those using word embeddings. Where character input replaces word-level input, it can yield smaller, less computationally intensive models, which helps when models need to be deployed on embedded devices. Character input can also help to address out-of-vocabulary words and/or spelling mistakes. It is thus of interest to know whether using character embeddings in place of word embeddings can be done without harming performance. In this paper, we investigate the use of character embeddings vs word embeddings when classifying short texts such as sentences and questions. We find that the models using character embeddings perform just as well as those using word embeddings whilst being much smaller and taking less time to train. Additionally, we demonstrate that using character embeddings makes the models more robust to spelling errors.
更多
查看译文
关键词
Word embeddings, Character embeddings, Long short-term memory networks, Convolutional networks, Text classification, Natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要