AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have developed a machine-learning approach to disambiguation that uses the links found within Wikipedia articles for training

Learning to link with wikipedia

CIKM, pp.509-518, (2008)

Cited by: 1325|Views194
EI

Abstract

This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator perfor...More

Code:

Data:

0
Introduction
  • Wikipedia has seen a meteoric rise in scale and popularity over the last few years
  • It is the largest, most visited encyclopedia in existence.
  • It is densely structured; its articles are peppered with hundreds of millions of links.
  • These connections explain the topics being discussed, and provide an environment where serendipitous encounters with information are commonplace.
  • Wikipedia is a classic “small world,” so richly hyperlinked that it takes, on average, just 4.5 clicks to get from one article to any other (Dolan, 2008)
Highlights
  • Wikipedia has seen a meteoric rise in scale and popularity over the last few years
  • We have developed a machine-learning approach to disambiguation that uses the links found within Wikipedia articles for training
  • Comparison of articles is facilitated by the Wikipedia Linkbased Measure we developed in previous work (Milne and Witten, 2008), which measures the semantic similarity of two Wikipedia pages by comparing their incoming and outgoing links
  • We have described an algorithm that disambiguates terms to their appropriate Wikipedia articles, and determines those that are most likely to be of interest to the reader
  • The resulting disambiguation classifier was 1% worse (f-measure) when disambiguating links, but behaves more consistently when incorporated into the wikifier
  • We have developed a tool that can accurately crossreference documents with the largest knowledge base in existence
Methods
  • Participants and Tasks

    To gather willing participants to inspect the wikified news stories the authors turned to Mechanical Turk (Barr and Cabrera 2006), a crowdsourcing service hosted by Amazon.
  • From the perspective of the people who develop these applications—who are known as requestors—the process is a function call where a question is asked and the answer is returned
  • What makes this system unique is the thousands-strong crowd of human contributors—or workers—who wait at the receiving end of the calls.
  • Mechanical Turk provided the means to conduct a labor-intensive experiment under significant time constraints, without having to gather participants ourselves
  • This raises some concerns about whether the anonymous workers could be trusted to invest the required effort and give well considered responses.
  • These are discussed which describe the two different types of tasks that the authors had the workers perform
Results
  • As is to be expected for subjective tasks, there was some disagreement between the evaluators.
  • 19.8 incorrect unhelpful
  • The authors resolved this issue by combining the responses in the analysis stage into a single option: that the link was irrelevant and/or unhelpful.
  • Following this combination, the authors found that 57% of the links received a unanimous decision from all three evaluators.
  • Because there is only one possible response that indicates a valid link, these were judged to be incorrect—for an unknown reason
Conclusion
  • The authors are by no means the first to recognize Wikipedia’s potential for describing and organizing information.
  • It is fast becoming the resource of choice for such tasks, and has been applied to text categorization (Gabrilovich and Markovich 2007), indexing (Medelyan et al 2008), clustering (Banerjee et al 2007), searching (Milne et al 2007), and a host of other problems
  • This popularity is entirely understandable: Wikipedia offers scale and multilingualism that dwarfs other knowledge bases, and an ability to evolve quickly and cover even the most turbulent of domains (Lih 2004).
  • Instead these methods are only evaluated extrinsically, by how well they support the overall task
Tables
  • Table1: Performance of classifiers for disambiguation over development data
  • Table2: Performance of disambiguation algorithms over final test data has the worst performance. There are dependencies between the features that lead this scheme astray. Interestingly Quinlan’s (1993) C4.5 algorithm outperforms the more sophisticated Support Vector Machine, and so it is used in the remainder of the paper. Feature selection makes no difference, and bagging improves the classifier by only 0.3%
  • Table3: Performance of classifiers for link detection
  • Table4: Performance of link detection algorithms stripped of all markup and handed to the link detector, which produced its own list of link-worthy topics for each article. This evaluation is only concerned with identifying the correct topics that should be linked to, and not the exact locations from which these links should be made. This is consistent with Mihalcea and Csomai’s work, which compared vocabularies of anchors, but not their locations
  • Table5: Accuracy of the automatically detected links
Download tables as Excel
Related work
  • Automatically augmenting text with links to web pages has been controversial in the past. When developing Windows XP, Microsoft released plans for the Smart-Tag service which was to automatically add links to web-pages within Windows Explorer. This was aborted when many expressed concern that pages were being “surreptitiously” modified for commercial purposes (Mossberg, 2001). Google’s AutoLink feature has received similar criticism and has not been widely accepted. Consequently automatic linking is most successful when restricted to safe domains such as cinema (Drenner et al 2006).

    Using Wikipedia as a destination for links sidesteps most of the concerns about automatic link generation, since the resource strives to be impartial and does not generate profits. To our knowledge, the only existing attempt to use Wikipedia in this way is the Wikify system developed by Mihalcea and Csomai (2007). This system works in two separates stages. The first, detection, involves identifying the terms and phrases from which links should be made.
Funding
  • Finally, we must of course acknowledge the tireless efforts of the Web 2.0 community, without whom resources like Wikipedia and Mechanical Turk would not exist. This research was conducted with funding from the New Zealand Tertiary Education Commission and the New Zealand Digital Library Group
Reference
  • Auer, S. and Bizer, C. and Kobilarov, G. and Lehmann, J. and Cyganiak, R. and Ives, Z. (2007) DBpedia: A Nucleus for a Web of Open Data. In Proceedings of the 6th International Semantic Web Conference, Busan, Korea.
    Google ScholarLocate open access versionFindings
  • Banerjee, S. and Ramanathan, K. and Gupta, A. (2007) Clustering short texts using Wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, pp. 787-788.
    Google ScholarLocate open access versionFindings
  • Barr, J. and Cabrera, L.F. (2006) AI gets a brain. In ACM Queue 4(4), pp. 24-29.
    Google ScholarLocate open access versionFindings
  • David, C., L. Giroux, S. Bertrand-Gastaldy, and D. Lanteigne (1995) Indexing as problem solving: A cognitive approach to consistency. In Proceedings of the ASIS Annual Meeting, Medford, NJ, pp. 49-55.
    Google ScholarLocate open access versionFindings
  • Dolan, S. (2008) Six Degrees of Wikipedia. Retrieved June 2008 from www.netsoc.tcd.ie/~mu/wiki/
    Locate open access versionFindings
  • Drenner, S., Harper, M., Frankowski, D., Riedl, J. and Terveen, L. (2006) Insert movie reference here: a system to bridge conversation and item-oriented web sites. In Proceedings of the SIGCHI conference on Human Factors in computing systems, New York, NY, pp. 951-954
    Google ScholarLocate open access versionFindings
  • Gabrilovich, E. and Markovitch, S. (2007) Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, Boston, MA.
    Google ScholarLocate open access versionFindings
  • Howe, J. (2006) The Rise of Crowdsourcing. In Wired Magazine 14(6).
    Google ScholarLocate open access versionFindings
  • Lih, A. (2004) Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, Texas.
    Google ScholarLocate open access versionFindings
  • Maron, M.E. (1977) On indexing, retrieval and the meaning of about. In Journal of the American Society for Information Science 28(1), pp. 38-43
    Google ScholarLocate open access versionFindings
  • Medelyan, O., Witten, I.H. and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.
    Google ScholarLocate open access versionFindings
  • Mihalcea, R. and Csomai, A. (2007) Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge management (CIKM’07), Lisbon, Portugal, pp. 233-242
    Google ScholarLocate open access versionFindings
  • Milne, D., Witten, I.H. and Nichols, D.M. (2007). A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2007), Lisbon, Portugal.
    Google ScholarLocate open access versionFindings
  • Milne, D., and Witten, I.H. (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.
    Google ScholarLocate open access versionFindings
  • Mossberg, W. (2001) New Windows XP Feature Can ReEdit Others' Sites. The Wall Street Journal, June 2001
    Google ScholarLocate open access versionFindings
  • Ponzetto, S.P. and Strube, M. (2007) Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence (AAAI’07), Vancouver, British Columbia, pp. 1440-1445.
    Google ScholarLocate open access versionFindings
  • Quinlan, J.R. (1993) C4. 5: Programs for Machine Learning. Morgan Kaufmann
    Google ScholarFindings
  • Suchanek, F.M. and Kasneci, G. and Weikum, G. (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW’07), Alberta, Canada, pp. 697-706.
    Google ScholarLocate open access versionFindings
  • Völkel, M. and Krötzsch, M. and Vrandecic, D. and Haller, H. and Studer, R. (2006) Semantic Wikipedia. In Proceedings of the 15th international conference on World Wide Web (WWW’06), Edinburgh, Scotland, pp. 585-594
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科