Combining Embeddings of Input Data for Text Classification

NEURAL PROCESSING LETTERS(2020)

引用 9|浏览18
暂无评分
摘要
The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided by Northfork ; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided by ProvenWord . The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.
更多
查看译文
关键词
Text classification, Multi-input network, Agglutinative language, Inflected language, Embedding combination
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要