AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We presented a large scale named entity disambiguation system that employs a huge amount of information automatically extracted from Wikipedia over a space of more than 1.4 million entities

Large-Scale Named Entity Disambiguation Based on {Wikipedia} Data

EMNLP-CoNLL, pp.708-716, (2007)

Cited by: 1209|Views195
EI
Full Text
Bibtex
Weibo

Abstract

This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process from Wikipedia. Through a process of maxim...More

Code:

Data:

Introduction
  • Introduction and Related Work

    The ability to identify the named entities has been established as an important task in several areas, including topic detection and tracking, machine translation, and information retrieval.
  • Note that an entity can be referred to by multiple surface forms (e.g., “George Bush” and “Bush”) and a surface form (e.g., “Bush”) can refer to multiple entities.
  • The current president of the U.S.) can be referred to by multiple surface forms (e.g., “George Bush” and “Bush”) and a surface form (e.g., “Bush”) can refer to multiple entities
  • When it was introduced, in the 6th Message Understanding Conference (Grishman and Sundheim, 1996), the named entity recognition task comprised three entity identification and labeling subtasks: ENAMEX, TIMEX and NUMEX.
  • Since 1995, other similar named entity recognition tasks have been defined, among which
Highlights
  • Introduction and Related Work

    The ability to identify the named entities has been established as an important task in several areas, including topic detection and tracking, machine translation, and information retrieval
  • We evaluated the system in two ways: on a set of Wikipedia articles, by comparing the system output with the references created by human contributors, and on a set of news stories, by doing a posthoc evaluation of the system output
  • We computed a disambiguation baseline in the following manner: for each surface form, if there was an entity page or redirect page whose title matches exactly the surface form we chose the corresponding entity as the baseline disambiguation; otherwise, we chose the entity most frequently mentioned in Wikipedia using that surface form
  • In an attempt to discard most of the non-named entities, we only kept for evaluation the surface forms that started with an uppercase letter
  • We presented a large scale named entity disambiguation system that employs a huge amount of information automatically extracted from Wikipedia over a space of more than 1.4 million entities
  • The application on a large scale of such an entity extraction and disambiguation system could result in a move from the current space of words to a space of concepts, which enables several paradigm shifts and opens new research directions, which we are currently investigating, from entity-based indexing and searching of document collections to personalized views of the Web through entitybased user bookmarks
Results
  • We used as development data for building the described system the Wikipedia collection as of April 2, 2006 and a set of 100 news stories on a diverse range of topics.
  • 130 of the surface forms were not used in other Wikipedia articles and both the baseline and the proposed system could not hypothesize a disambiguation for them.
  • When restricting the test set only to the 1,668 ambiguous surface forms, the difference in accuracy between the two systems is significant at p = 0.01.
  • An error analysis showed that the Wikipedia set used as gold standard contained relatively many surface forms with erroneous or out-of-date links, many of them being correctly disambiguated by the proposed system.
  • The test page “The Gods” links to Paul Newton, the painter, and Uriah Heep, which is a disambiguation page, probably because the original pages changed over time, while the proposed system correctly hypothesizes links to Paul Newton and Uriah Heep
Conclusion
  • We presented a large scale named entity disambiguation system that employs a huge amount of information automatically extracted from Wikipedia over a space of more than 1.4 million entities.
  • In tests on both real news data and Wikipedia text, the system obtained accuracies exceeding 91% and 88%.
  • The application on a large scale of such an entity extraction and disambiguation system could result in a move from the current space of words to a space of concepts, which enables several paradigm shifts and opens new research directions, which we are currently investigating, from entity-based indexing and searching of document collections to personalized views of the Web through entitybased user bookmarks
Reference
  • Bagga, A. and B. Baldwin. 1998. Entity-based crossdocument coreferencing using the vector space model. In Proceedings of COLING-ACL, 79-85.
    Google ScholarLocate open access versionFindings
  • Bunescu, R. and M. Pa ca. 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of EACL, 9-16.
    Google ScholarLocate open access versionFindings
  • Cederberg, S. and D. Widdows. 200Using LSA and noun coordination information to improve the precision and recall of hyponymy extraction. In Proceedings of CoNLL, 111-118.
    Google ScholarLocate open access versionFindings
  • Doddington, G., A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, and R. Weischedel. 200ACE program – task definitions and performance measures. In Proceedings of LREC, 837-840.
    Google ScholarLocate open access versionFindings
  • Edmonds, P. and S. Cotton. 2001. Senseval-2 overview. In Proceedings of SENSEVAL-2, 1-6.
    Google ScholarLocate open access versionFindings
  • Gabrilovich, E. and S. Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of IJCAI, 1606-1611.
    Google ScholarLocate open access versionFindings
  • Gale, W., K. Church, and D. Yarowsky. 1992. One sense per discourse. In Proceedings of the 4th DARPA SNL Workshop, 233-237.
    Google ScholarLocate open access versionFindings
  • Grishman, R. and B. Sundheim. 1996. Message Understanding Conference - 6: A brief history. In Proceedings of COLING, 466-471.
    Google ScholarLocate open access versionFindings
  • Hearst, M. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proc. COLING, 539-545.
    Google ScholarLocate open access versionFindings
  • Hirschman, L. and N. Chinchor. 1997. MUC-7 Coreference Task Definition. In Proceedings of MUC-7.
    Google ScholarLocate open access versionFindings
  • Kanada, Y. 1999. A method of geographical name extraction from Japanese text. In Proceedings of CIKM, 46-54.
    Google ScholarLocate open access versionFindings
  • Kilgarriff, A. and J. Rosenzweig. 2000. Framework and results for English Senseval. Computers and Humanities, Special Issue on SENSEVAL, 15-48.
    Google ScholarLocate open access versionFindings
  • Lapata, M. and F. Keller. 2004. The Web as a Baseline: Evaluating the Performance of Unsupervised Webbased Models for a Range of NLP Tasks. In Proceedings of HLT, 121-128.
    Google ScholarLocate open access versionFindings
  • Mann, G. S. and D. Yarowsky. 2003. Unsupervised Personal Name Disambiguation. In Proceedings of CoNLL, 33-40.
    Google ScholarLocate open access versionFindings
  • Mihalcea, R., T. Chklovski, and A. Kilgarriff. The Senseval-3 English lexical sample task. In Proceedings of SENSEVAL-3, 25-28.
    Google ScholarLocate open access versionFindings
  • Overell, S., and S. Rüger. 2006 Identifying and grounding descriptions of places. In SIGIR Workshop on Geographic Information Retrieval.
    Google ScholarLocate open access versionFindings
  • Raghavan, H., J. Allan, and A. McCallum. 2004. An exploration of entity models, collective classification and relation description. In KDD Workshop on Link Analysis and Group Detection.
    Google ScholarLocate open access versionFindings
  • Ravin, Y. and Z. Kazi. 1999. Is Hillary Rodham Clinton the President? In ACL Workshop on Coreference and it's Applications.
    Google ScholarLocate open access versionFindings
  • Remy, M. 2002. Wikipedia: The free encyclopedia. In Online Information Review, 26(6): 434.
    Google ScholarLocate open access versionFindings
  • Roark, B. and E. Charniak. 1998. Noun-phrase cooccurrence statistics for semi-automatic semantic lexicon construction. In Proceedings of COLINGACL, 1110-1116.
    Google ScholarLocate open access versionFindings
  • Salton, G. 1989. Automatic Text Processing. AddisonWesley.
    Google ScholarFindings
  • Smith, D. A. and G. Crane. 2002. Disambiguating geographic names in a historic digital library. In Proceedings of ECDL, 127-136.
    Google ScholarLocate open access versionFindings
  • Strube, M. and S. P. Ponzeto. 2006. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of AAAI, 1419-1424.
    Google ScholarLocate open access versionFindings
  • Tjong Kim Sang, E. F. and F. De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: LanguageIndependent Named Entity Recognition. In Proceedings of CoNLL, 142-147.
    Google ScholarLocate open access versionFindings
  • Wacholder, N., Y. Ravin, and M. Choi. 1997. Disambiguation of proper names in text. In Proceedings of ANLP, 202-208.
    Google ScholarLocate open access versionFindings
  • Woodruff, A. G. and C. Paunt. GIPSY:Automatic geographic indexing of documents. Journal of the American Society for Information Science and Technology, 45(9):645-655.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科