AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Our objective is the automatic extraction of structured data from natural-language text on Wikipedia and eventually the whole Web, our investigation has uncovered some lessons that directly benefit Wikipedia and similar collaborative knowledge repositories

Autonomously semantifying wikipedia

International Conference on Information and Knowledge Management, pp.41-50, (2007)

Cited by: 536|Views131
EI

Abstract

Berners-Lee's compelling vision of a Semantic Web is hindered by a chicken-and-egg problem, which can be best solved by a boot- strapping method — creating enough structured data to motivate thedevelopmentofapplications. Thispaperarguesthatautonomously "Semantifying Wikipedia" is the best way to solve the problem. We choose Wikipedia as a...More

Code:

Data:

0
Introduction
  • While compelling in the long term, Berners-Lee’s vision of the Semantic Web [5] is developing slowly.
  • The ideal vision is a system which autonomously extracts information from the Web. Because of the wide range of information categories, supervised machine learning will require too much human effort to scale.
  • Instead, such a system should use unsupervised or self-supervised techniques
  • Several systems of this form have been proposed, e.g. MULDER [18], AskMSR [7], and KNOWITALL [14], showing some signs of early success.
  • Many of the things published on the Web are incorrect (e.g. “Elvis killed John Kennedy”), and the increasing linguistic sophistication of link spam poses a growing challenge to these methods
Highlights
  • While compelling in the long term, Berners-Lee’s vision of the Semantic Web [5] is developing slowly
  • Experiments in Section 4 show that this small adjustment greatly improves the performance of the learned conditional random fields extractor
  • Our objective is the automatic extraction of structured data from natural-language text on Wikipedia and eventually the whole Web, our investigation has uncovered some lessons that directly benefit Wikipedia and similar collaborative knowledge repositories
  • Meaning lies in the graph structure of concepts defined in terms of each other, and KYLIN helps complete that graph
  • This paper described KYLIN, a prototype system which autonomously extracts structured data from Wikipedia and regularlizes its internal link structure
  • We propose bootstrapping the Semantic Web by mining Wikipedia and we identify some unique challenges editting
Methods
  • Recall that when producing training data for extractor-learning, the preprocessor uses a strict pairing model
  • Since this may cause numerous sentences to be incorrectly labelled as negative examples, KYLIN uses the sentence classifier to relabel some of the training data as follows.
  • KYLIN trains a different CRF extractor for each attribute, rather than training a single master extractor that clips all attributes.
  • The authors chose this architecture largely for simplicity — by keeping each attribute’s extractor independent, the authors ensure that the complexity does not multiply
Results
  • In the “U.S County” class less than 50% of the articles have an infobox.
  • As shown in Section 4, the baseline document classifier achieves very high precision (98.5%) and reasonable recall (68.8%).
  • The authors can see there is little difference between KYLIN and the optimal one, and both of them perform more than 10% better than the worst ordering
Conclusion
  • The authors' objective is the automatic extraction of structured data from natural-language text on Wikipedia and eventually the whole Web, the investigation has uncovered some lessons that directly benefit Wikipedia and similar collaborative knowledge repositories.
  • KYLIN does even better.
  • By automatically identifying missing internal links for proper nouns, more semantic tags are added.
  • Because these links resolve noun phrases to unique identifiers, they are useful for many purposes such as information retrieval, structural analysis, and further semantic processing.
  • Meaning lies in the graph structure of concepts defined in terms of each other, and KYLIN helps complete that graph
Tables
  • Table1: Feature sets used by the CRF extractor a domain-independent set which is fast to compute; our current implementation uses the sentence’s tokens and their part of speech (POS) tags as features
  • Table2: Estimated precision of the document classifier
  • Table3: Estimated recall of the document classifier
  • Table4: Relative performance of people and KYLIN on infobox attribute extraction
  • Table5: Performance of various link-generation heuristics on existing links
  • Table6: Performance of various link-generation heuristics on new links
  • Table7: Effect of different heuristic orders on link generation performance
Download tables as Excel
Related work
  • We group related work into several categories: bootstrapping the semantic web, unsupervised information extraction, extraction from Wikipedia, and related Wikipedia-based systems.

    Bootstrapping the Semantic Web: REVERE [17] aims to cross the chasm between structured and unstructured data by providing a platform to facilitate the authoring, querying and sharing of data. It relies on human effort to gain semantic data, while our KYLIN is fully autonomous. DeepMiner [30] bootstraps domain ontologies for semantic web services from source web sites. It extracts concepts and instances from semi-structured data over source interface and data pages, while KYLIN handles both semi-structured and unstructured data in Wikipedia. The SemTag and Seeker [10] systems perform automated semantic tagging of large corpora. They use the TAP knowledge base [27] as the standard ontology, and use it to match instances on the Web. In contrast, KYLIN doesn’t assume any particular ontology, and tries to extract all desired semantic data within Wikipedia.
Funding
  • This work was supported by NSF grant IIS- pairs summarizing an article’s properties
  • Based on self- 0307906, ONR grant N00014-06-1-0147, SRI CALO grant 03- supervised learning, KYLIN achieves performance which is 000225 and the WRF / TJ Cable Professorship
Reference
  • http://opennlp.sourceforge.net/.
    Findings
  • S. F. Adafre and M. de Rijke. Discovering missing links in wikipedia. In Proceedings of the 3rd International Workshop on Link Discovery at KDD05, Chicago, USA, August 2005.
    Google ScholarLocate open access versionFindings
  • S. Auer and J. Lehmann. What have Innsbruck and Leipzig in common? Extracting semantics from wiki content. In ESWC, 2007.
    Google ScholarFindings
  • M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007.
    Google ScholarLocate open access versionFindings
  • T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001.
    Google ScholarFindings
  • L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
    Google ScholarLocate open access versionFindings
  • E. Brill, S. Dumais, and M. Banko. An analysis of the AskMSR question-answering system. In Proceedings of EMNLP, 2002.
    Google ScholarLocate open access versionFindings
  • C. L. A. Clarke, G. V. Cormack, and T. R. Lynam. Exploiting redundancy in question answering. In Proceedings of the 24th Annual International ACM SIGIR Conference, 2001.
    Google ScholarLocate open access versionFindings
  • R. de Salvo Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons. An inference model for semantic entailment in natural language. In National Conference on Artificial Intelligence (AAAI), pages 1678–1679, 2005.
    Google ScholarLocate open access versionFindings
  • S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. Tomlin, and J. Y. Zien. Semtag and Seeker: bootstrapping the Semantic Web via automated semantic annotation. In Proceedings of 12th International World Wide Web Conference, pages 178–186, 2003.
    Google ScholarLocate open access versionFindings
  • A. Doan and A. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration, 2005.
    Google ScholarLocate open access versionFindings
  • D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In Procs. of IJCAI 2005, 2005.
    Google ScholarLocate open access versionFindings
  • S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. Web question answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR Conference, 2002.
    Google ScholarLocate open access versionFindings
  • O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165(1):91–134, 2005.
    Google ScholarLocate open access versionFindings
  • E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, pages 1301–1306, 2006.
    Google ScholarLocate open access versionFindings
  • E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 2007.
    Google ScholarLocate open access versionFindings
  • A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. In Proceedings of CIDR, 2003.
    Google ScholarLocate open access versionFindings
  • C. T. Kwok, O. Etzioni, and D. Weld. Scaling question answering to the Web. ACM Transactions on Information Systems (TOIS), 19(3):242–262, 2001.
    Google ScholarLocate open access versionFindings
  • J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland, May 2001.
    Google ScholarLocate open access versionFindings
  • B. MacCartney and C. D. Manning. Natural logic for textual inference. In Workshop on Textual Entailment and Paraphrasing, ACL 2007, 2007.
    Google ScholarLocate open access versionFindings
  • A. K. McCallum. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu, 2002.
    Locate open access versionFindings
  • R. Meir and G. Ratsch. An introduction to boosting and leveraging. Journal of Artificial Intelligence Research, Advanced Lectures on Machine Learning:118–183, 2003.
    Google ScholarLocate open access versionFindings
  • D. P. Nguyen, Y. Matsuo, and M. Ishizuka. Exploiting syntactic and semantic information for relation extraction from wikipedia. In IJCAI07-TextLinkWS, 2007.
    Google ScholarFindings
  • K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.
    Google ScholarLocate open access versionFindings
  • D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, pages 169–198, 1999.
    Google ScholarLocate open access versionFindings
  • S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence, pages 1440–1445, 2007.
    Google ScholarLocate open access versionFindings
  • E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117–124, Providence, RI, 1997.
    Google ScholarLocate open access versionFindings
  • F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. In Proceedings of the 16th International Conference on World Wide Web, 2007.
    Google ScholarLocate open access versionFindings
  • M. Volkel, M. Krotzsch, D. Vrandecic, H. Haller, and R. Studer. Semantic wikipedia. In Proceedings of the 15th International Conference on World Wide Web, 2006.
    Google ScholarLocate open access versionFindings
  • W. Wu, A. Doan, C. Yu, and W. Meng. Bootstrapping domain ontology for Semantic Web services from source web sites. In Proceedings of the VLDB-05 Workshop on Technologies for E-Services, 2005.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科