AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We found that the subsampling of the frequent words results in both faster training and significantly better representations of uncommon words

Distributed Representations of Words and Phrases and their Compositionality.

neural information processing systems, (2013): 3111-3119

Cited by: 29354|Views1139
EI
Full Text
Bibtex
Weibo

Abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsa...More

Code:

Data:

Introduction
  • Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words.
  • Unlike most of the previously used neural network architectures for learning word vectors, training of the Skipgram model does not involve dense matrix multiplications.
  • This makes the training extremely efficient: an optimized single-machine implementation can train on more than 100 billion words in one day
Highlights
  • Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words
  • We present a simplified variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work [8]
  • The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector x such that vec(x) is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance
  • The performance of various Skip-gram models on the word analogy test set is reported in Table 1
  • We show how to train distributed representations of words and phrases with the Skip-gram model and demonstrate that these representations exhibit linear structure that makes precise analogical reasoning possible
  • We found that the subsampling of the frequent words results in both faster training and significantly better representations of uncommon words
Methods
  • The following results use 10−5 subsampling where f is the frequency of word wi and t is a chosen threshold, typically around 10−5
  • The authors chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies.
  • This subsampling formula was chosen heuristically, the authors found it to work well in practice
  • It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words, as will be shown .
  • HS with 10−5 subsampling Italian explorer Aral Sea moonwalker Ionian Islands Garry Kasparov
Results
  • The authors evaluate the Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative Sampling, and subsampling of the training words.
  • The task consists of analogies such as “Germany” : “Berlin” :: “France” : ?, which are solved by finding a vector x such that vec(x) is closest to vec(“Berlin”) - vec(“Germany”) + vec(“France”) according to the cosine distance.
  • The authors show that Negative Sampling outperforms the Hierarchical Softmax on the analogical reasoning task, and has even slightly better performance than the Noise Contrastive Estimation.
  • The subsampling of the frequent words improves the training speed several times and makes the word representations significantly more accurate
Conclusion
  • The authors successfully trained models on several orders of magnitude more data than the previously published models, thanks to the computationally efficient model architecture
  • This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities.
  • The authors found that the subsampling of the frequent words results in both faster training and significantly better representations of uncommon words
  • Another contribution of the paper is the Negative sampling algorithm, which is an extremely simple training method that learns accurate representations especially for frequent words
Tables
  • Table1: Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task as defined in [<a class="ref-link" id="c8" href="#r8">8</a>]. NEG-k stands for Negative Sampling with k negative samples for each positive sample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical Softmax with the frequency-based Huffman codes
  • Table2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples). The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset
  • Table3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset
  • Table4: Examples of the closest entities to the given short phrases, using two different models
  • Table5: Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model
  • Table6: Examples of the closest tokens given various well known models and the Skip-gram model trained on phrases using over 30 billion training words. An empty cell means that the word was not in the vocabulary
Download tables as Excel
Reference
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 513–520, 2011.
    Google ScholarLocate open access versionFindings
  • Michael U Gutmann and Aapo Hyvarinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13:307–361, 2012.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5528–5531. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas Burget and Jan Cernocky. Strategies for Training Large Scale Neural Network Language Models. In Proc. Automatic Speech Recognition and Understanding, 2011.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, PhD Thesis, Brno University of Technology, 2012.
    Google ScholarFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop, 2013.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in neural information processing systems, 21:1081–1088, 2009.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
    Findings
  • Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the international workshop on artificial intelligence and statistics, pages 246–252, 2005.
    Google ScholarLocate open access versionFindings
  • David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk. Continuous space language models. Computer Speech and Language, vol. 21, 2007.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML), volume 2, 2011.
    Google ScholarLocate open access versionFindings
  • Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2012.
    Google ScholarLocate open access versionFindings
  • Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computational Linguistics, 2010.
    Google ScholarLocate open access versionFindings
  • Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. In Journal of Artificial Intelligence Research, 37:141-188, 2010.
    Google ScholarLocate open access versionFindings
  • Peter D. Turney. Distributional semantics beyond words: Supervised learning of analogy and paraphrase. In Transactions of the Association for Computational Linguistics (TACL), 353–366, 2013.
    Google ScholarLocate open access versionFindings
  • Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, pages 2764–2770. AAAI Press, 2011.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科