Yago: a core of semantic knowledge

    WWW, pp. 697-706, 2007.

    Cited by: 2977|Bibtex|Views33|Links
    EI
    Keywords:
    million factis-a hierarchymillion entityextensible ontologyresulting knowledge baseMore(10+)
    Wei bo:
    We demonstrated how YAGO can be extended by facts extracted from Web documents through state-of-the-art extraction techniques

    Abstract:

    We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically...More

    Code:

    Data:

    Introduction
    • Many applications in modern information technology utilize ontological background knowledge.
    • The existing applications typically use only a single source of background knowledge
    • They could boost their performance, if a huge ontology with knowledge from several sources was available.
    • It would have to be extensible, re-usable, and application-independent
    • If such an ontology were available, it could boost the performance of existing applications and open up the path towards new applications in the Semantic Web era
    Highlights
    • 1.1 Motivation

      Many applications in modern information technology utilize ontological background knowledge
    • Machine translation (e.g. [5]) and word sense disambiguation (e.g. [3]) exploit lexical knowledge, query expansion uses taxonomies (e.g. [15, 11, 27]), document classification based on supervised or semi-supervised learning can be combined with ontologies (e.g. [14]), and [13] demonstrates the utility of background knowledge for question answering and information retrieval
    • Since common sense often does not suffice to judge the correctness of YAGO facts, we presented them a snippet of the corresponding Wikipedia page
    • It would be pointless to evaluate the portion of YAGO that stems from WordNet, because we can assume human accuracy here
    • We presented YAGO, a light-weight and extendable ontology of high quality and coverage
    • We demonstrated how YAGO can be extended by facts extracted from Web documents through state-of-the-art extraction techniques
    Results
    • Evaluation and Experiments

      5.1 Manual evaluation 5.1.1 Accuracy

      The authors were interested in the accuracy of YAGO.
    • To evaluate the accuracy of an ontology, its facts have to be compared to some ground truth.
    • The authors' evaluation compared YAGO against the ground truth of Wikipedia.
    • It would be pointless to evaluate the non-heuristic relations in YAGO, such as describes, means, or context.
    • This is why the authors evaluated only those facts that constitute potentially weak points in the ontology.
    Conclusion
    • The authors presented YAGO, a light-weight and extendable ontology of high quality and coverage.
    • YAGO contains 900,000 entities and 5 million facts – more than any other publicly available formal ontology.
    • The authors demonstrated how YAGO can be extended by facts extracted from Web documents through state-of-the-art extraction techniques.
    • The authors observed that the more facts YAGO contains, the easier it is to extend it by further facts.
    • This positive feedback loop could facilitate the growth of the knowledge base.
    • YAGO will be made available in different export formats, including plain text, XML, RDFS and SQL database formats
    Summary
    • Introduction:

      Many applications in modern information technology utilize ontological background knowledge.
    • The existing applications typically use only a single source of background knowledge
    • They could boost their performance, if a huge ontology with knowledge from several sources was available.
    • It would have to be extensible, re-usable, and application-independent
    • If such an ontology were available, it could boost the performance of existing applications and open up the path towards new applications in the Semantic Web era
    • Results:

      Evaluation and Experiments

      5.1 Manual evaluation 5.1.1 Accuracy

      The authors were interested in the accuracy of YAGO.
    • To evaluate the accuracy of an ontology, its facts have to be compared to some ground truth.
    • The authors' evaluation compared YAGO against the ground truth of Wikipedia.
    • It would be pointless to evaluate the non-heuristic relations in YAGO, such as describes, means, or context.
    • This is why the authors evaluated only those facts that constitute potentially weak points in the ontology.
    • Conclusion:

      The authors presented YAGO, a light-weight and extendable ontology of high quality and coverage.
    • YAGO contains 900,000 entities and 5 million facts – more than any other publicly available formal ontology.
    • The authors demonstrated how YAGO can be extended by facts extracted from Web documents through state-of-the-art extraction techniques.
    • The authors observed that the more facts YAGO contains, the easier it is to extend it by further facts.
    • This positive feedback loop could facilitate the growth of the knowledge base.
    • YAGO will be made available in different export formats, including plain text, XML, RDFS and SQL database formats
    Tables
    • Table1: Accuracy of YAGO
    • Table2: Coverage of YAGO (facts)
    • Table3: Coverage of YAGO (entities)
    • Table4: Coverage of other ontologies
    • Table5: Sample facts of YAGO
    • Table6: Sample queries on YAGO
    • Table7: Leila headquarteredIn facts Abandoned candidates because of an unknown city because of an ambiguous city because of an ambiguous company 0
    Download tables as Excel
    Related work
    • Knowledge representation is an old field in AI and has provided numerous models from frames and KL-ONE to recent variants of description logics and

      RDFS and OWL (see [21] and [23]). Numerous approaches have been proposed to create general-purpose ontologies on top of these representations. One class of approaches focusses on extracting knowledge structures automatically from text corpora. These approaches use information extraction technologies that include pattern matching, natural-language parsing, and statistical learning [25, 9, 4, 1, 22, 19, 8]. These techniques have also been used to extend WordNet by Wikipedia individuals [20]. Another project along these lines is KnowItAll [9], which aims at extracting and compiling instances of unary and binary predicate instances on a very large scale – e.g., as many soccer players as possible or almost all company/CEO pairs from the business world. Although these approaches have recently improved the quality of their results considerably, the quality is still significantly below that of a man-made knowledge base. Typical results contain many false positives (e.g., IsA(Aachen Cathedral, City), to give one example from KnowItAll). Furthermore, obtaining a recall above 90 percent for a closed domain typically entails a drastic loss of precision in return. Thus, information-extraction approaches are only of little use for applications that need near-perfect ontologies (e.g. for automated reasoning). Furthermore, they typically do not have an explicit (logic-based) knowledge representation model.
    Reference
    • E. Agichtein and L. Gravano. Snowball: extracting relations from large plain-text collections. In ICDL, 2000.
      Google ScholarLocate open access versionFindings
    • F. Baader and T. Nipkow. Term rewriting and all that. Cambridge University Press, New York, NY, USA, 1998.
      Google ScholarFindings
    • R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In EACL, 2006.
      Google ScholarLocate open access versionFindings
    • M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. KnowItNow: Fast, scalable information extraction from the web. In EMNLP, 2005.
      Google ScholarLocate open access versionFindings
    • N. Chatterjee, S. Goyal, and A. Naithani. Resolving pattern ambiguity for english to hindi machine translation using WordNet. In Workshop on Modern Approaches in Translation Technologies, 2005.
      Google ScholarLocate open access versionFindings
    • S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005.
      Google ScholarLocate open access versionFindings
    • W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, 2004.
      Google ScholarLocate open access versionFindings
    • H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL, 2002.
      Google ScholarLocate open access versionFindings
    • O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In WWW, 2004.
      Google ScholarLocate open access versionFindings
    • C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.
      Google ScholarFindings
    • J. Graupmann, R. Schenkel, and G. Weikum. The spheresearch engine for unified ranked retrieval of heterogeneous XML and web documents. In VLDB, 2005.
      Google ScholarLocate open access versionFindings
    • I. Horrocks, O. Kutz, and U. Sattler. The even more irresistible SROIQ. In KR, 2006.
      Google ScholarLocate open access versionFindings
    • W. Hunt, L. Lita, and E. Nyberg. Gazetteers, wordnet, encyclopedias, and the web: Analyzing question answering resources. Technical Report CMU-LTI-04-188, Language Technologies Institute, Carnegie Mellon, 2004.
      Google ScholarFindings
    • G. Ifrim and G. Weikum. Transductive learning for text classification using explicit knowledge models. In PKDD, 2006.
      Google ScholarLocate open access versionFindings
    • S. Liu, F. Liu, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SIGIR, 2004.
      Google ScholarLocate open access versionFindings
    • C. Matuszek, J. Cabral, M. Witbrock, and J. DeOliveira. An introduction to the syntax and content of Cyc. In AAAI Spring Symposium, 2006.
      Google ScholarLocate open access versionFindings
    • I. Niles and A. Pease. Towards a standard upper ontology. In FOIS, 2001.
      Google ScholarLocate open access versionFindings
    • N. F. Noy, A. Doan, and A. Y. Halevy. Semantic integration. AI Magazine, 26(1):7–10, 2005.
      Google ScholarLocate open access versionFindings
    • P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In ACL, 2006.
      Google ScholarLocate open access versionFindings
    • M. Ruiz-Casado, E. Alfonseca, and P. Castells. Automatic extraction of semantic relationships for WordNet by means of pattern learning from Wikipedia. In NLDB, pages 67–79, 2006.
      Google ScholarLocate open access versionFindings
    • S. Russell and P. Norvig. Artificial Intelligence: a Modern Approach. Prentice Hall, 2002.
      Google ScholarFindings
    • R. Snow, D. Jurafsky, and A. Y. Ng. Semantic taxonomy induction from heterogenous evidence. In ACL, 2006.
      Google ScholarLocate open access versionFindings
    • S. Staab and R. Studer. Handbook on Ontologies. Springer, 2004.
      Google ScholarFindings
    • F. Suchanek, G. Kasneci, M. Ramanath, and G. Weikum. Naga: Uncoiling the Web. Research Report MPI-I-2006-5-007, Max-Planck-Institut fur Informatik, Germany, 2006.
      Google ScholarFindings
    • F. M. Suchanek, G. Ifrim, and G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. In KDD, 2006.
      Google ScholarLocate open access versionFindings
    • F. M. Suchanek, G. Ifrim, and G. Weikum. LEILA: Learning to Extract Information by Linguistic Analysis. In Workshop on Ontology Population at ACL/COLING, 2006.
      Google ScholarFindings
    • M. Theobald, R. Schenkel, and G. Weikum. TopX and XXL at INEX 2005. In INEX, 2005.
      Google ScholarLocate open access versionFindings
    • W3C. Sparql, 2005. retrieved from http://www.w3.org/TR/rdf-sparqlquery/. Below you find a list of the most recent technical reports of the Max-Planck-Institut fur Informatik. They are available by anonymous ftp from ftp.mpi-sb.mpg.de under the directory pub/papers/reports. Most of the reports are also accessible via WWW using the URL http://www.mpi-sb.mpg.de. If you have any questions concerning ftp or WWW access, please contact reports@mpi-sb.mpg.de. Paper copies (which are not necessarily free of charge)can be ordered either by regular mail or by e-mail at the address below.
      Locate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments