Knowledge vault: a web-scale approach to probabilistic knowledge fusion

    KDD, pp. 601-610, 2014.

    Cited by: 1106|Bibtex|Views162|Links
    EI
    Keywords:
    information extractionknowledge basesmachine learningprobabilistic modelsstatistical databasesMore(1+)
    Wei bo:
    In this paper we described how we built a Web-scale probabilistic knowledge base, which we call Knowledge Vault

    Abstract:

    Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, w...More

    Code:

    Data:

    0
    Introduction
    • Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph.
    • To increase the scale even further, the authors need to explore automatic methods for constructing knowledge bases.
    • The authors introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content with prior knowledge derived from existing knowledge repositories.
    • The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness.
    • The authors report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.
    • For nothing can be loved or hated unless it is first known.”
    Highlights
    • Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph
    • We propose a new way of automatically constructing a Web-scale probabilistic knowledge base, which we call the Knowledge Vault, or Knowledge Vault for short
    • In this paper we described how we built a Web-scale probabilistic knowledge base, which we call Knowledge Vault
    • In contrast to previous work, we fuse together multiple extraction sources with prior knowledge derived from an existing knowledge bases
    • The facts in Knowledge Vault have associated probabilities, which we show are wellcalibrated, so that we can distinguish what we know with high confidence from what we are uncertain about
    Methods
    • 3.1.1 Text documents (TXT)

      The authors use relatively standard methods for relation extraction from text, but the authors do so at a much larger scale than previous systems.

      The authors first run a suite of standard NLP tools over each document.
    • The authors use relatively standard methods for relation extraction from text, but the authors do so at a much larger scale than previous systems.
    • The authors first run a suite of standard NLP tools over each document.
    • These perform named entity recognition, part of speech tagging, dependency parsing, co-reference resolution, and entity linkage.
    • The in-house named entity linkage system the authors use is similar to the methods described in [18].
    • The features that the authors use are similar to those described in [29]
    Results
    • Using the methods to be described in Section 3, the authors extract about 1.6B candidate triples, covering 4469 different types of relations and 1100 different types of entities.
    • To ensure that certain common predicates did not dominate the performance measures, the authors took at most 10k instances of each predicate when creating the test set.
    • The authors pooled the samples from each predicate to get a more balanced test set
    Conclusion
    • Knowledge Vault is a large repository of useful knowledge, there are still many ways in which it can be improved.
    • A simple way to handle this is to collect together all candidate values, and to force the distribution over them to sum to 1
    • This is similar to the notion of an X-tuple in probabilistic databases [40].
    • The authors might have a fact that Obama was born in Honolulu, and another one stating he was born in Hawaii
    • These are not mutually exclusive, so the naive approach does not work.
    • The authors hope to continue to scale KV, to store more knowledge about the world, and to use this resource to help various downstream applications, such as question answering, entity-based search, etc
    Summary
    • Introduction:

      Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph.
    • To increase the scale even further, the authors need to explore automatic methods for constructing knowledge bases.
    • The authors introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content with prior knowledge derived from existing knowledge repositories.
    • The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness.
    • The authors report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.
    • For nothing can be loved or hated unless it is first known.”
    • Methods:

      3.1.1 Text documents (TXT)

      The authors use relatively standard methods for relation extraction from text, but the authors do so at a much larger scale than previous systems.

      The authors first run a suite of standard NLP tools over each document.
    • The authors use relatively standard methods for relation extraction from text, but the authors do so at a much larger scale than previous systems.
    • The authors first run a suite of standard NLP tools over each document.
    • These perform named entity recognition, part of speech tagging, dependency parsing, co-reference resolution, and entity linkage.
    • The in-house named entity linkage system the authors use is similar to the methods described in [18].
    • The features that the authors use are similar to those described in [29]
    • Results:

      Using the methods to be described in Section 3, the authors extract about 1.6B candidate triples, covering 4469 different types of relations and 1100 different types of entities.
    • To ensure that certain common predicates did not dominate the performance measures, the authors took at most 10k instances of each predicate when creating the test set.
    • The authors pooled the samples from each predicate to get a more balanced test set
    • Conclusion:

      Knowledge Vault is a large repository of useful knowledge, there are still many ways in which it can be improved.
    • A simple way to handle this is to collect together all candidate values, and to force the distribution over them to sum to 1
    • This is similar to the notion of an X-tuple in probabilistic databases [40].
    • The authors might have a fact that Obama was born in Honolulu, and another one stating he was born in Hawaii
    • These are not mutually exclusive, so the naive approach does not work.
    • The authors hope to continue to scale KV, to store more knowledge about the world, and to use this resource to help various downstream applications, such as question answering, entity-based search, etc
    Tables
    • Table1: Comparison of knowledge bases. KV, DeepDive, NELL, and PROSPERA rely solely on extraction, Freebase and KG rely on human curation and structured sources, and YAGO2 uses both strategies. Confident facts means with a probability of being true at or above 0.9
    • Table2: Performance of different extraction systems
    • Table3: Some of the paths learned by PRA for predicting where someone went to college. Rules are sorted by decreasing precision. Column headers: F1 is the harmonic mean of precision and recall, P is the precision, R is the recall, W is the weight given to this feature by logistic regression
    • Table4: Nearest neighbors for some predicates in the 60d embedding space learned by the neural network. Numbers represent squared Euclidean distance. Edu-start and edu-end represent the start and end dates of someone attending a school or college. Similarly, job-start and job-end represent the start and end dates of someone holding a particular job
    • Table5: AUC scores for the fused prior, extractor and prior+extractor using different labels on the 10k test set
    Download tables as Excel
    Related work
    • There is a growing body of work on automatic knowledge base construction [44, 1]. This literature can be clustered into 4 main groups: (1) approaches such as YAGO [39], YAGO2 [19], DBpedia [3], and Freebase [4], which are built on Wikipedia infoboxes and other structured data sources; (2) approaches such as Reverb [12], OLLIE [26], and PRISMATIC [13], which use open information (schema-less) extraction techniques applied to the entire web; (3) approaches such as NELL/ ReadTheWeb [8], PROSPERA [30], and DeepDive/ Elementary [32], which extract information from the entire web, but use a fixed ontology/ schema; and (4) approaches such as Probase [47], which construct taxonomies (is-a hierarchies), as opposed to general KBs with multiple types of predicates.

      The knowledge vault is most similar to methods of the third kind, which extract facts, in the form of disambiguated triples, from the entire web. The main difference from this prior work is that we fuse together facts extracted from text with prior knowledge derived from the Freebase graph.

      There is also a large body of work on link prediction in graphs. This can be thought as creating a joint probability model over a large set of binary random variables, where G(s, p, o) = 1 if and only if there is a link of type p from s to o. The literature can be clustered into three main kinds of methods: (1) methods that directly model the correlation between the variables, using discrete Markov random fields (e.g., [23]) or continuous relaxations thereof (e.g., [34]); (2) methods that use latent variables to model the correlations indirectly, using either discrete factors (e.g., [48]) or continuous factors (e.g., [31, 11, 20, 37]); and (3) methods that approximate the correlation using algorithmic approaches, such as random walks [24].
    Reference
    • AKBC-WEKEX. The Knowledge Extraction Workshop at NAACL-HLT, 2012.
      Google ScholarLocate open access versionFindings
    • G. Angeli and C. Manning. Philosophers are mortal: Inferring the truth of unseen facts. In CoNLL, 2013.
      Google ScholarFindings
    • S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a web of open data. In The semantic web, pages 722–735, 2007.
      Google ScholarLocate open access versionFindings
    • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250. ACM, 2008.
      Google ScholarLocate open access versionFindings
    • A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Joint learning of words and meaning representations for open-text semantic parsing. In AI/Statistics, 2012.
      Google ScholarLocate open access versionFindings
    • M. Cafarella, A. Halevy, Z. D. Wang, E. Wu, and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. VLDB, 1(1):538–549, 2008.
      Google ScholarLocate open access versionFindings
    • M. J. Cafarella, A. Y. Halevy, and J. Madhavan. Structured data on the web. Commun. ACM, 54(2):72–79, 2011.
      Google ScholarLocate open access versionFindings
    • A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., and T. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
      Google ScholarLocate open access versionFindings
    • O. Deshpande, D. Lambda, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Building, maintaing and using knowledge bases: A report from the trenches. In SIGMOD, 2013.
      Google ScholarLocate open access versionFindings
    • X. L. Dong, L. Berti-Equille, and D. Srivastatva. Integrating conflicting data: the role of source dependence. In VLDB, 2009.
      Google ScholarLocate open access versionFindings
    • L. Drumond, S. Rendle, and L. Schmidt-Thieme. Predicting RDF Triples in Incomplete Knowledge Bases with Tensor Factorization. In 10th ACM Intl. Symp. on Applied Computing, 2012.
      Google ScholarLocate open access versionFindings
    • A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.
      Google ScholarLocate open access versionFindings
    • J. Fan, D. Ferrucci, D. Gondek, and A. Kalyanpur. Prismatic: Inducing knowledge from a large scale lexicalized relation resource. In First Intl. Workshop on Formalisms and Methodology for Learning by Reading, pages 122–127. Association for Computational Linguistics, 2010.
      Google ScholarFindings
    • T. Franz, A. Schultz, S. Sizov, and S. Staab. TripleRank: Ranking Semantic Web Data by Tensor Decomposition. In ISWC, 2009.
      Google ScholarLocate open access versionFindings
    • L. A. Galarraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413–422, 2013.
      Google ScholarLocate open access versionFindings
    • R. Grishman. Information extraction: Capabilities and challenges. Technical report, NYU Dept. CS, 2012.
      Google ScholarLocate open access versionFindings
    • R. Gupta, A. Halevy, X. Wang, S. Whang, and F. Wu. Biperpedia: An Ontology for Search Applications. In VLDB, 2014.
      Google ScholarLocate open access versionFindings
    • B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. Curran. Evaluating entity linking with wikipedia. Artificial Intelligence, 194:130–150, 2013.
      Google ScholarLocate open access versionFindings
    • J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence Journal, 2012.
      Google ScholarLocate open access versionFindings
    • R. Jenatton, N. L. Roux, A. Bordes, and G. Obozinski. A latent factor model for highly multi-relational data. In NIPS, 2012.
      Google ScholarLocate open access versionFindings
    • H. Ji, T. Cassidy, Q. Li, and S. Tamang. Tackling Representation, Annotation and Classification Challenges for Temporal Knowledge Base Population. Knowledge and Information Systems, pages 1–36, August 2013.
      Google ScholarLocate open access versionFindings
    • H. Ji and R. Grishman. Knowledge base population: successful approaches and challenges. In Proc. ACL, 2011.
      Google ScholarLocate open access versionFindings
    • S. Jiang, D. Lowd, and D. Dou. Learning to refine an automatically extracted knowledge base using markov logic. In Intl. Conf. on Data Mining, 2012.
      Google ScholarLocate open access versionFindings
    • N. Lao, T. Mitchell, and W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011.
      Google ScholarLocate open access versionFindings
    • X. Li and R. Grishman. Confidence estimation for knowledge base population. In Recent Advances in NLP, 2013.
      Google ScholarLocate open access versionFindings
    • Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open language learning for information extraction. In EMNLP, 2012.
      Google ScholarLocate open access versionFindings
    • T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
      Google ScholarFindings
    • B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In NAACL, 2013.
      Google ScholarFindings
    • M. Mintz, S. Bills, R. Snow, and D. Jurafksy. Distant supervision for relation extraction without labeled data. In Prof. Conf. Recent Advances in NLP, 2009.
      Google ScholarLocate open access versionFindings
    • N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227–236, 2011.
      Google ScholarLocate open access versionFindings
    • M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing YAGO: scalable machine learning for linked data. In WWW, 2012.
      Google ScholarLocate open access versionFindings
    • F. Niu, C. Zhang, and C. Re. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. Intl. J. On Semantic Web and Information Systems, 2012.
      Google ScholarLocate open access versionFindings
    • J. Platt. Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers. MIT Press, 2000.
      Google ScholarLocate open access versionFindings
    • J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge graph identification. In International Semantic Web Conference (ISWC), 2013.
      Google ScholarLocate open access versionFindings
    • L. Reyzin and R. Schapire. How boosting the margin can also boost classifier complexity. In Intl. Conf. on Machine Learning, 2006.
      Google ScholarLocate open access versionFindings
    • A. Ritter, L. Zettlemoyer, Mausam, and O. Etzioni. Modeling missing data in distant supervision for information extraction. Trans. Assoc. Comp. Linguistics, 1, 2013.
      Google ScholarLocate open access versionFindings
    • R. Socher, D. Chen, C. Manning, and A. Ng. Reasoning with Neural Tensor Networks for Knowledge Base Completion. In NIPS, 2013.
      Google ScholarLocate open access versionFindings
    • R. Speer and C. Havasi. Representing general relational knowledge in conceptnet 5. In Proc. of LREC Conference, 2012.
      Google ScholarLocate open access versionFindings
    • F. Suchanek, G. Kasneci, and G. Weikum. YAGO - A Core of Semantic Knowledge. In WWW, 2007.
      Google ScholarFindings
    • D. Suciu, D. Olteanu, C. Re, and C. Koch. Probabilistic Databases. Morgan & Claypool, 2011.
      Google ScholarFindings
    • B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. The singularity is not near: slowing growth of wikipedia. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration, WikiSym ’09, pages 8:1–8:10, 2009.
      Google ScholarLocate open access versionFindings
    • P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wi. Recovering semantics of tables on the web. In Proc. of the VLDB Endowment, 2012.
      Google ScholarLocate open access versionFindings
    • D. Z. Wang, E. Michelakis, M. Garofalakis, and J. Hellerstein. BayesStore: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models. In VLDB, 2008.
      Google ScholarLocate open access versionFindings
    • G. Weikum and M. Theobald. From information to knowledge: harvesting entities and relationships from web sources. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 65–76. ACM, 2010.
      Google ScholarLocate open access versionFindings
    • M. Wick, S. Singh, A. Kobren, and A. McCallum. Assessing confidence of knowledge base content with an experimental study in entity resolution. In AKBC workshop, 2013.
      Google ScholarFindings
    • M. Wick, S. Singh, H. Pandya, and A. McCallum. A Joint Model for Discovering and Linking Entities. In AKBC Workshop, 2013.
      Google ScholarLocate open access versionFindings
    • W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, pages 481–492. ACM, 2012.
      Google ScholarLocate open access versionFindings
    • Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In UAI, 2006.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments