AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Mikolov et al introduced a new evaluation scheme based on word analogies that probes the finer structure of the word vector space by examining not the scalar distance between word vectors, but rather their various dimensions of difference

Glove: Global Vectors for Word Representation.

EMNLP, pp.1532-1543, (2014)

Cited by: 23028|Views685
EI
Full Text
Bibtex
Weibo

Abstract

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The resu...More

Code:

Data:

Introduction
  • Semantic vector space models of language represent each word with a real-valued vector.
  • The analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation king − queen = man − woman.
  • This evaluation scheme favors models that produce dimensions of meaning, thereby capturing the multi-clustering idea of distributed representations (Bengio, 2009)
Highlights
  • Semantic vector space models of language represent each word with a real-valued vector
  • Mikolov et al (2013c) introduced a new evaluation scheme based on word analogies that probes the finer structure of the word vector space by examining not the scalar distance between word vectors, but rather their various dimensions of difference
  • We present results on the word analogy task in Table 2
  • 4.5 Model Analysis: Corpus Size In Fig. 3, we show performance on the word analogy task for 300-dimensional vectors trained on different corpora
  • Considerable attention has been focused on the question of whether distributional word representations are best learned from count-based
  • In this work we argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous
Methods
  • Word analogies.
  • The word analogy task consists of questions like, “a is to b as c is to ?” The dataset contains 19,544 such questions, divided into a semantic subset and a syntactic subset.
  • The semantic questions are typically analogies about people or places, like “Athens is to Greece as Berlin is to ?”.
  • The authors answer the question “a is to b as c is to ?” by finding the word d whose representation wd is closest to wb − wa + wc according to the cosine similarity.4
Results
  • The authors present results on the word analogy task in Table 2.
  • The authors' results using the word2vec tool are somewhat better than most of the previously published results.
  • This is due to a number of factors, including the choice to use negative sampling, the number of negative samples, and the choice of the corpus.
  • The authors note that increasing the corpus size does not guarantee improved results for other models, as can be seen by the decreased performance of the SVD-
Conclusion
  • Considerable attention has been focused on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods.
  • The authors construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec.
  • The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks
Tables
  • Table1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam
  • Table2: Results on the word analogy task, given as percent accuracy. Underlined scores are best within groups of similarly-sized models; bold scores are best overall. HPCA vectors are publicly available2; (i)vLBL results are from (Mnih et al, 2013); skip-gram (SG) and CBOW results are from (<a class="ref-link" id="cMikolov_et+al_2013_a" href="#rMikolov_et+al_2013_a">Mikolov et al, 2013a</a>,b); we trained SG† and CBOW† using the word2vec tool3. See text for details and a description of the SVD models
  • Table3: Spearman rank correlation on word similarity tasks. All vectors are 300-dimensional. The CBOW∗ vectors are from the word2vec website and differ in that they contain phrase vectors
  • Table4: F1 score on NER task with 50d vectors. Discrete is the baseline without word vectors. We use publicly-available vectors for HPCA, HSMN, and CW. See text for details
Download tables as Excel
Related work
  • Matrix Factorization Methods. Matrix factorization methods for generating low-dimensional word representations have roots stretching as far back as LSA. These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus. The particular type of information captured by such matrices varies by application. In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus. In contrast, the Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996), for example, utilizes matrices of “term-term” type, i.e., the rows and columns correspond to words and the entries correspond to the number of times a given word occurs in the context of another given word.
Funding
  • Stanford University gratefully acknowledges the support of the Defense Threat Reduction Agency (DTRA) under Air Force Research Laboratory (AFRL) contract no
Study subjects and analysis
datasets: 3
The CoNLL-2003 English benchmark dataset for NER is a collection of documents from Reuters newswire articles, annotated with four entity types: person, location, organization, and miscellaneous. We train models on CoNLL-03 training data on test on three datasets: 1) ConLL-03 testing data, 2) ACE Phase 2 (2001-02) and ACE-2003 data, and 3) MUC7 Formal Run test set. We adopt the BIO2 annotation standard, as well as all the preprocessing steps described in (Wang and Manning, 2013)

negative samples: 10
With word2vec, we train the skip-gram (SG†) and continuous bag-of-words (CBOW†) models on the 6 billion token corpus (Wikipedia 2014 + Gigaword 5) with a vocabulary of the top 400,000 most frequent words and a context window size of 10. We used 10 negative samples, which we show in Section 4.6 to be a good choice for this corpus. For the SVD baselines, we generate a truncated matrix Xtrunc which retains the information of how frequently each word occurs with only the top 10,000 most frequent words

negative samples: 5
We conclude that the GloVe vectors are useful in downstream NLP tasks, as was first. 8We use the same parameters as above, except in this case we found 5 negative samples to work slightly better than 10. Model Dev Test ACE MUC7 Discrete 91.0 85.4 77.4 73.4

Reference
  • Tom M. Apostol. 1976. Introduction to Analytic Number Theory. Introduction to Analytic Number Theory.
    Google ScholarFindings
  • Marco Baroni, Georgiana Dinu, and German Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL.
    Google ScholarFindings
  • Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning.
    Google ScholarFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. JMLR, 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • John A. Bullinaria and Joseph P. Levy. 2007. Extracting semantic representations from word cooccurrence statistics: A computational study. Behavior Research Methods, 39(3):510–526.
    Google ScholarLocate open access versionFindings
  • Dan C. Ciresan, Alessandro Giusti, Luca M. Gambardella, and Jurgen Schmidhuber. 2012. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, pages 2852–2860.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of ICML, pages 160–167.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. JMLR, 12:2493–2537.
    Google ScholarLocate open access versionFindings
  • Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41.
    Google ScholarLocate open access versionFindings
  • John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12.
    Google ScholarLocate open access versionFindings
  • Lev Finkelstein, Evgenly Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM.
    Google ScholarLocate open access versionFindings
  • Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 20Improving
    Google ScholarFindings
  • Remi Lebret and Ronan Collobert. 2014. Word embeddings through Hellinger PCA. In EACL.
    Google ScholarFindings
  • Omer Levy, Yoav Goldberg, and Israel RamatGan. 20Linguistic regularities in sparse and explicit word representations. CoNLL-2014.
    Google ScholarLocate open access versionFindings
  • Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28:203–208.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL-2013.
    Google ScholarFindings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In ICLR Workshop Papers.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In HLTNAACL.
    Google ScholarFindings
  • George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28.
    Google ScholarLocate open access versionFindings
  • Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS.
    Google ScholarFindings
  • Douglas L. T. Rohde, Laura M. Gonnerman, and David C. Plaut. 2006. An improved model of semantic similarity based on lexical co-occurence. Communications of the ACM, 8:627–633.
    Google ScholarLocate open access versionFindings
  • Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
    Google ScholarLocate open access versionFindings
  • Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47.
    Google ScholarLocate open access versionFindings
  • Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing With Compositional Vector Grammars. In ACL.
    Google ScholarFindings
  • Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the SIGIR Conference on Research and Development in Informaion Retrieval.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL-2003.
    Google ScholarFindings
  • Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL, pages 384–394.
    Google ScholarLocate open access versionFindings
  • Mengqiu Wang and Christopher D. Manning. 2013. Effect of non-linear deep architecture in sequence labeling. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科