Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters

Proceedings of the 28th ACM International Conference on Information and Knowledge Management(2019)

引用 25|浏览45
暂无评分
摘要
Chinese word embeddings have recently attracted much attention in natural language processing (NLP). Existing researches learn Chinese word embeddings based on characters, radicals, components and stroke n-gram. Besides abovementioned features, Chinese characters also own structure and pinyin features. In this paper, we design feature substring, a super set of radicals, components and stroke n-gram with structure and pinyin information, to integrate stroke, structure and pinyin features of Chinese characters and capture the semantics of Chinese words. Based on the feature substring, we propose a novel method ssp2vec to predict the contextual words based on the feature substrings of the target words for learning Chinese word embeddings. It is based on our observation that exploiting the morphological information (stroke and structure) and the phonetic information (pinyin) is crucial for capturing the meanings of Chinese words. Meanwhile, the phonetic information (pinyin) can assist the model to distinguish Chinese words. Experimental results on word analogy, word similarity, text classification and named entity recognition tasks show that the proposed method obtains better results than state-of-the-art approaches.
更多
查看译文
关键词
chinese word embeddings, feature substring, morphological information, phonetic information.
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要