Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding

    WWW, pp. 1271-1279, 2017.

    Cited by: 111|Bibtex|Views25|Links
    EI
    Keywords:
    knowledge graph embeddingquery logsemantic scholarsentity spaceranking modelMore(12+)
    Wei bo:
    We developed Explicit Semantic Ranking, a new technique that utilizes the explicit semantics from a knowledge graph in academic search

    Abstract:

    This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ...More

    Code:

    Data:

    Introduction
    • The Semantic Scholar (S2) launched in late 2015, with the goal of helping researchers find papers without digging through irrelevant information.
    • The information needs behind such queries sometimes are hard for term-frequency based ranking models to fulfill.
    • The authors' error analysis using user clicks found that word-based ranking models sometimes fail to capture the semantic meaning behind such queries.
    • This constitutes a major error source in S2’s ranking
    Highlights
    • The Semantic Scholar (S2) launched in late 2015, with the goal of helping researchers find papers without digging through irrelevant information
    • In user studies conducted at Allen Institute, this ranking model provides at least comparable accuracy with other academic search engines
    • This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique to connect query and documents using semantic information from a knowledge graph
    • We developed Explicit Semantic Ranking (ESR), a new technique that utilizes the explicit semantics from a knowledge graph in academic search
    • In Explicit Semantic Ranking, queries and documents are represented in the entity space using their annotations, and the ranking is defined by their semantic relatedness described by their entities’ connections, in an embedding, pooling, and ranking framework
    • This paper presents a new method of using knowledge graphs to improve the ranking of academic search
    Methods
    • Based on which edge type is used to obtain the entity embedding, there are four versions of ESR: ESR-Author, ESR-Context, ESR-Desc, and ESR-Venue.
    • Embeddings for ESR-Author and ESR-Venue are trained with authors and venues with more than 1 publication.
    • Description and context embeddings are trained with entities and terms with the minimum frequency of 5.
    • Corpus entities do not have multiple surface forms so CMNS reduces to exact match.
    • Freebase entities are linked using surface forms collected from Google’s FACC1 annotation [10]
    Results
    • Five experiments investigated entity linking and document ranking accuracy, as well as the effects of three system components.
    Conclusion
    • Analysis of Semantic Scholar’s query logs revealed that a large percentage of head queries involve research concepts, and that a major source of error was the inability of even a well-tuned bag-of-words system to rank them accurately
    • To address this challenge, the authors developed Explicit Semantic Ranking (ESR), a new technique that utilizes the explicit semantics from a knowledge graph in academic search.
    • In ESR, queries and documents are represented in the entity space using their annotations, and the ranking is defined by their semantic relatedness described by their entities’ connections, in an embedding, pooling, and ranking framework
    Summary
    • Introduction:

      The Semantic Scholar (S2) launched in late 2015, with the goal of helping researchers find papers without digging through irrelevant information.
    • The information needs behind such queries sometimes are hard for term-frequency based ranking models to fulfill.
    • The authors' error analysis using user clicks found that word-based ranking models sometimes fail to capture the semantic meaning behind such queries.
    • This constitutes a major error source in S2’s ranking
    • Methods:

      Based on which edge type is used to obtain the entity embedding, there are four versions of ESR: ESR-Author, ESR-Context, ESR-Desc, and ESR-Venue.
    • Embeddings for ESR-Author and ESR-Venue are trained with authors and venues with more than 1 publication.
    • Description and context embeddings are trained with entities and terms with the minimum frequency of 5.
    • Corpus entities do not have multiple surface forms so CMNS reduces to exact match.
    • Freebase entities are linked using surface forms collected from Google’s FACC1 annotation [10]
    • Results:

      Five experiments investigated entity linking and document ranking accuracy, as well as the effects of three system components.
    • Conclusion:

      Analysis of Semantic Scholar’s query logs revealed that a large percentage of head queries involve research concepts, and that a major source of error was the inability of even a well-tuned bag-of-words system to rank them accurately
    • To address this challenge, the authors developed Explicit Semantic Ranking (ESR), a new technique that utilizes the explicit semantics from a knowledge graph in academic search.
    • In ESR, queries and documents are represented in the entity space using their annotations, and the ranking is defined by their semantic relatedness described by their entities’ connections, in an embedding, pooling, and ranking framework
    Tables
    • Table1: Distribution of relevance labels in Semantic Scholar’s benchmark dataset. S2 shows the number and percentage of query-document pairs from the 100 testing queries that are labeled to the corresponding relevance level. TREC shows the statistics of the relevance labels from TREC Web Track 2009-2012’s 200 queries
    • Table2: Entity linking evaluation results. Entities are linked by CMNS. Corpus shows the results when using automatically extracted keyphrases as the targets. Freebase shows the results when using Freebase entities as the targets. Precison and Recall from lean evaluation and strict evaluation are displayed
    • Table3: Overall accuracy of ESR compared to Semantic Scholar (S2). ESR-Author, ESR-Context, ESR-Desc and ESR-Venue are ESR with entity embedding trained from corresponding edges. Relative performances compared with S2 are in percentages. Win/Tie/Loss are the number of queries a method improves, does not change, or hurts, compared with S2. Best results in each metric are marked Bold. Statistically significant improvements (P>0.05) over S2 are marked by †
    • Table4: Performance of different strategies that make use of the knowledge graph in ranking. Raw directly calculates the entity similarities in the original discrete space. Mean uses mean-pooling when generalizing the entity translation matrix to query-document ranking evidence. Max uses max-pooling. Mean&Bin replaces the max-pooling in ESR’s first stage with mean-pooling. Relative performances (percentages), statistically significant differences (†), and Win/Tie/Loss are compared with the ESR version that uses the same edge type and embedding; for example, RawAuthor versus ESR-Author
    Download tables as Excel
    Related work
    • Prior research in academic search is more focused on the analysis of the academic graph than on ad-hoc ranking. Microsoft uses its Microsoft Academic Graph to build academic dialog and recommendation systems [20]. Other research on academic graphs includes the extraction and disambiguation of authors, integration of different publication resources [21], and expert finding [2, 7, 29]. The academic graph can also be used to model the importance of papers [23] and to extract new entities [1].

      Soft match is a widely studied topic in information retrieval, mostly in word-based search systems. Translation models treat ranking as translations between query terms and document terms using a translation matrix [3]. Topic modeling techniques have been used to first map query and document into a latent space, and then matching them in it [24]. Word embedding and deep learning techniques have been studied recently. One possibility is to first build query and document representations using their words’ embeddings heuristically, and then match them in the embedding space [22]. The DSSM model directly trains a representation model using deep neural networks, which learns distributed representations for the query and document, and matches them using the learned representations [13]. A more recent method, DRMM, models the query-document relevance with a neural network built upon the word-level translation matrix [11]. The translation matrix is calculated with pretrained word embeddings. The word-level translation scores are summarized by bin-pooling (histograms) and then used by the ranking neural network.
    Funding
    • This research was supported by National Science Foundation (NSF) grant IIS-1422676 and a gift from the Allen Institute for Artificial Intelligence
    Reference
    • A. Arnold and W. W. Cohen. Information extraction as link prediction: Using curated citation networks to improve gene detection. In International Conference on Wireless Algorithms, Systems, and Applications, pages 541–550.
      Google ScholarLocate open access versionFindings
    • K. Balog, L. Azzopardi, and M. De Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2006), pages 43–50. ACM, 2006.
      Google ScholarLocate open access versionFindings
    • A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999), pages 222–229. ACM, 1999.
      Google ScholarLocate open access versionFindings
    • C. Caragea, F. A. Bulgarov, A. Godea, and S. Das Gollapalli. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1435–1446. Association for Computational Linguistics, 2014.
      Google ScholarLocate open access versionFindings
    • D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P. Hsu, and K. Wang. ERD’14: Entity recognition and disambiguation challenge. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014). ACM, 2014.
      Google ScholarLocate open access versionFindings
    • J. Chen, C. Xiong, and J. Callan. An empirical study of learning to rank for entity search. In Proceedings of the 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,(SIGIR 2016). ACM, 201To Appear.
      Google ScholarLocate open access versionFindings
    • N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the trec 2005 enterprise track. In Proceedings of The 14th Text Retrieval Conference (TREC 2005), volume 5, pages 199–205, 2005.
      Google ScholarLocate open access versionFindings
    • J. Dalton, L. Dietz, and J. Allan. Entity query feature expansion using knowledge base links. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 365–374. ACM, 2014.
      Google ScholarLocate open access versionFindings
    • L. Dietz, A. Kotov, and E. Meij. Utilizing knowledge bases in text-centric information retrieval. In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval (ICTIR 2016), pages 5–5. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • E. Gabrilovich, M. Ringgaard, and A. Subramanya. FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0), June 2013.
      Google ScholarFindings
    • J. Guo, Y. Fan, A. Qingyao, and W. Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016), page To Appear. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • F. Hasibi, K. Balog, and S. E. Bratsberg. Entity linking in queries: Tasks and evaluation. In Proceedings of the Fifth ACM International Conference on The Theory of Information Retrieval (ICTIR 2015), pages 171–180. ACM, 2015.
      Google ScholarLocate open access versionFindings
    • P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM 2013), pages 2333–2338. ACM, 2013.
      Google ScholarLocate open access versionFindings
    • T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pages 133–142. ACM, 2002.
      Google ScholarLocate open access versionFindings
    • T.-Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
      Google ScholarLocate open access versionFindings
    • X. Liu and H. Fang. Latent entity space: a novel retrieval approach for entity-bearing queries. Information Retrieval Journal, 18(6):473–503, 2015.
      Google ScholarLocate open access versionFindings
    • X. Liu, P. Yang, and H. Fang. Entity came to rescue Leveraging entities to minimize risks in web search. In Proceedings of The 23st Text Retrieval Conference, (TREC 2014). NIST, 2014.
      Google ScholarLocate open access versionFindings
    • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 2tth Advances in Neural Information Processing Systems 2013 (NIPS 2013), pages 3111–3119, 2013.
      Google ScholarLocate open access versionFindings
    • H. Raviv, O. Kurland, and D. Carmel. Document retrieval using entity-based language models. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2016), pages 65–74. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-j. P. Hsu, and K. Wang. An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th International Conference on World Wide Web (WWW 2015), pages 243–246. ACM, 2015.
      Google ScholarLocate open access versionFindings
    • J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2008), pages 990–998. ACM, 2008.
      Google ScholarLocate open access versionFindings
    • I. Vulic and M.-F. Moens. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2015), pages 363–372. ACM, 2015.
      Google ScholarLocate open access versionFindings
    • A. D. Wade, K. Wang, Y. Sun, and A. Gulli. Wsdm cup 2016: Entity ranking challenge. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM 2016), pages 593–594. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2006), pages 178–185. ACM, 2006.
      Google ScholarLocate open access versionFindings
    • C. Xiong and J. Callan. EsdRank: Connecting query and documents through external semi-structured data. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), pages 951–960. ACM, 2015.
      Google ScholarLocate open access versionFindings
    • C. Xiong and J. Callan. Query expansion with Freebase. In Proceedings of the fifth ACM International Conference on the Theory of Information Retrieval (ICTIR 2015), pages 111–120. ACM, 2015.
      Google ScholarLocate open access versionFindings
    • C. Xiong, J. Callan, and T.-Y. Liu. Bag-of-entity representation for ranking. In Proceedings of the sixth ACM International Conference on the Theory of Information Retrieval (ICTIR 2016), pages 181–184. ACM, 2016.
      Google ScholarLocate open access versionFindings
    • Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on Wikipedia. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009), pages 59–66. ACM, 2009.
      Google ScholarLocate open access versionFindings
    • J. Zhang, J. Tang, and J. Li. Expert finding in a social network. In International Conference on Database Systems for Advanced Applications, pages 1066–1069.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments