Enhancement Of The Word2vec Class-Based Language Modeling By Optimizing The Features Vector Using Pca

Tiba Zaki Abdulhameed,Imed Zitouni,Ikhlas Abdel-Qader

2018 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY (EIT)（2018）

引用 0|浏览11

暂无评分

摘要

Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions.Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.

查看译文

关键词

word embedding, word features' vector size, class-based language model, word2vec parameters, PCA, MSA, Iraqi dialect

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要