AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Advantages: Word embedding based unsupervised learning Substitute methods based on external semantic knowledge Crucial application in search, query suggestion

Short Text Similarity with Word Embeddings

ACM International Conference on Information and Knowledge Management, (2015)

Cited by: 380|Views215
EI

Abstract

Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and di...More

Code:

Data:

Introduction
  • Why Short Text Similarity? How Traditional Approaches Fail? Word Embedding. How Traditional Approaches Fail?

    Lexical Matching – Largest common substring, edit distance, lexical overlap

    Linguistic Analysis – Parse tree following grammar feature

    Linguistic Analysis – Parse tree following grammar feature Not all texts are necessarily parseable (e.g., tweets) High-quality parses usually expensive to compute at run time.

    Structured Semantic Knowledge – WordNet, Wikipedia

    Structured Semantic Knowledge – WordNet, Wikipedia Not available to all language, and domain-specific terms.
  • Linguistic Analysis – Parse tree following grammar feature Not all texts are necessarily parseable High-quality parses usually expensive to compute at run time.
  • Structured Semantic Knowledge – WordNet, Wikipedia.
  • Word Embedding: distributional similarity based representations build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context Example ”hotel” = [0.286 0.792 -0.177 -0.107 0.109 -0.542 0.349 0.271] ”motel” = [0.280 0.772 -0.171 -0.107 0.109 -0.542 0.349 0.271].
Highlights
  • Why Short Text Similarity? How Traditional Approaches Fail? Word Embedding
  • Cannot go from word-level to text-level similarity text structure should be taken into account
  • Word Embedding: distributional similarity based representations build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context Example ”hotel” = [0.286 0.792 -0.177 -0.107 0.109 -0.542 0.349 0.271] ”motel” = [0.280 0.772 -0.171 -0.107 0.109 -0.542 0.349 0.271]
  • Auxiliary word embeddings trained on INEX with 1.2 billion tokens based either on Word2vec or Global Vectors for Word Representation (GloVe) Algorithm to optimize parameter setting for predicting short text similarity
  • At low lexical overlap level, the algorithm shows the benefit of semantic matching over lexical matching
  • Advantages: Word embedding based unsupervised learning Substitute methods based on external semantic knowledge Crucial application in search, query suggestion
Methods
  • Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm.
  • Saliency-weighted Semantic Similarity From BM25.
  • Saliency-weighted Semantic Similarity r (q, d) = w ∈q∩d IDF (w ) ·.
  • Saliency-weighted Semantic Similarity fsts.
  • W ∈s fsem(w , w ) returns semantic match score from word embedding Common words has smaller IDF(w) than rare words.
  • Auxiliary word embeddings trained on INEX with 1.2 billion tokens based either on Word2vec or GloVe Algorithm to optimize parameter setting for predicting short text similarity.
  • Pre-trained Out-of-the-Box word embeddings Word2vec 300-dimensions by Mikolov et al Word2vec 400-dimensions by Baroni et al GloVe 300-dimensional trained on 840 billion token corpus GloVe 300-dimensional trained on 42 billion token corpus
Results
  • OoB: aux: w2v: glv: out-of-the-box vectors auxiliary vectors Word2vec GloVe unwghtd: unweighted semantic feature swsn: saliency-weighted semantic feature

    Best model uses all features and word embedding models The method overall outperform previous approaches.
Conclusion
  • Summary Experiment Analysis Conclusion

    Introduction

    Why Short Text Similarity?

    Why Short Text Similarity?

    Example – The procedure is generally performed in the second or third trimester. – The technique is used during the second and, occasionally, third trimester of pregnancy.

    Word-level similarity not enough query-query similarity, query-image caption similarity

    Cannot go from word-level to text-level similarity text structure should be taken into account.
  • Example – The procedure is generally performed in the second or third trimester.
  • Word-level similarity not enough query-query similarity, query-image caption similarity.
  • Cannot go from word-level to text-level similarity text structure should be taken into account.
  • Three bin threshold: Similarity level.
  • Performance Across Levels of Lexical Overlap.
  • At low lexical overlap level, the algorithm shows the benefit of semantic matching over lexical matching.
  • Advantages: Word embedding based unsupervised learning Substitute methods based on external semantic knowledge Crucial application in search, query suggestion
Study subjects and analysis
Separate the word pairs: 4
2 Fully connected, unweighted, bipartite graph. 3 Maximum Bipartite Matching 4 Separate the word pairs into bins of different similarity level. 1 For each pair of terms (w1, w2) in S1 and S2, compute the cosine similarities

Separate the word pairs: 4
2 Fully connected, unweighted, bipartite graph. 3 Maximum Bipartite Matching 4 Separate the word pairs into bins of different similarity level. Not all terms are equally important

sentence pairs: 5801
3 Summary Experiment Analysis Conclusion. Dataset: Microsoft Research Paraphrase(MSR) Corpus 5801 sentence pairs annotated with binary labels divided into training set of 4076, and testing set of 1725. Handle Out-of-vocabulary word ignore in training, map randomly in runtime

Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科