AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We have developed a machine-learning approach to disambiguation that uses the links found within Wikipedia articles for training
Learning to link with wikipedia
CIKM, pp.509-518, (2008)
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator perfor...More
PPT (Upload PPT)
- Wikipedia has seen a meteoric rise in scale and popularity over the last few years
- It is the largest, most visited encyclopedia in existence.
- It is densely structured; its articles are peppered with hundreds of millions of links.
- These connections explain the topics being discussed, and provide an environment where serendipitous encounters with information are commonplace.
- Wikipedia is a classic “small world,” so richly hyperlinked that it takes, on average, just 4.5 clicks to get from one article to any other (Dolan, 2008)
- Wikipedia has seen a meteoric rise in scale and popularity over the last few years
- We have developed a machine-learning approach to disambiguation that uses the links found within Wikipedia articles for training
- Comparison of articles is facilitated by the Wikipedia Linkbased Measure we developed in previous work (Milne and Witten, 2008), which measures the semantic similarity of two Wikipedia pages by comparing their incoming and outgoing links
- We have described an algorithm that disambiguates terms to their appropriate Wikipedia articles, and determines those that are most likely to be of interest to the reader
- The resulting disambiguation classifier was 1% worse (f-measure) when disambiguating links, but behaves more consistently when incorporated into the wikifier
- We have developed a tool that can accurately crossreference documents with the largest knowledge base in existence
- Participants and Tasks
To gather willing participants to inspect the wikified news stories the authors turned to Mechanical Turk (Barr and Cabrera 2006), a crowdsourcing service hosted by Amazon.
- From the perspective of the people who develop these applications—who are known as requestors—the process is a function call where a question is asked and the answer is returned
- What makes this system unique is the thousands-strong crowd of human contributors—or workers—who wait at the receiving end of the calls.
- Mechanical Turk provided the means to conduct a labor-intensive experiment under significant time constraints, without having to gather participants ourselves
- This raises some concerns about whether the anonymous workers could be trusted to invest the required effort and give well considered responses.
- These are discussed which describe the two different types of tasks that the authors had the workers perform
- As is to be expected for subjective tasks, there was some disagreement between the evaluators.
- 19.8 incorrect unhelpful
- The authors resolved this issue by combining the responses in the analysis stage into a single option: that the link was irrelevant and/or unhelpful.
- Following this combination, the authors found that 57% of the links received a unanimous decision from all three evaluators.
- Because there is only one possible response that indicates a valid link, these were judged to be incorrect—for an unknown reason
- The authors are by no means the first to recognize Wikipedia’s potential for describing and organizing information.
- It is fast becoming the resource of choice for such tasks, and has been applied to text categorization (Gabrilovich and Markovich 2007), indexing (Medelyan et al 2008), clustering (Banerjee et al 2007), searching (Milne et al 2007), and a host of other problems
- This popularity is entirely understandable: Wikipedia offers scale and multilingualism that dwarfs other knowledge bases, and an ability to evolve quickly and cover even the most turbulent of domains (Lih 2004).
- Instead these methods are only evaluated extrinsically, by how well they support the overall task
- Table1: Performance of classifiers for disambiguation over development data
- Table2: Performance of disambiguation algorithms over final test data has the worst performance. There are dependencies between the features that lead this scheme astray. Interestingly Quinlan’s (1993) C4.5 algorithm outperforms the more sophisticated Support Vector Machine, and so it is used in the remainder of the paper. Feature selection makes no difference, and bagging improves the classifier by only 0.3%
- Table3: Performance of classifiers for link detection
- Table4: Performance of link detection algorithms stripped of all markup and handed to the link detector, which produced its own list of link-worthy topics for each article. This evaluation is only concerned with identifying the correct topics that should be linked to, and not the exact locations from which these links should be made. This is consistent with Mihalcea and Csomai’s work, which compared vocabularies of anchors, but not their locations
- Table5: Accuracy of the automatically detected links
- Automatically augmenting text with links to web pages has been controversial in the past. When developing Windows XP, Microsoft released plans for the Smart-Tag service which was to automatically add links to web-pages within Windows Explorer. This was aborted when many expressed concern that pages were being “surreptitiously” modified for commercial purposes (Mossberg, 2001). Google’s AutoLink feature has received similar criticism and has not been widely accepted. Consequently automatic linking is most successful when restricted to safe domains such as cinema (Drenner et al 2006).
Using Wikipedia as a destination for links sidesteps most of the concerns about automatic link generation, since the resource strives to be impartial and does not generate profits. To our knowledge, the only existing attempt to use Wikipedia in this way is the Wikify system developed by Mihalcea and Csomai (2007). This system works in two separates stages. The first, detection, involves identifying the terms and phrases from which links should be made.
- Finally, we must of course acknowledge the tireless efforts of the Web 2.0 community, without whom resources like Wikipedia and Mechanical Turk would not exist. This research was conducted with funding from the New Zealand Tertiary Education Commission and the New Zealand Digital Library Group
- Auer, S. and Bizer, C. and Kobilarov, G. and Lehmann, J. and Cyganiak, R. and Ives, Z. (2007) DBpedia: A Nucleus for a Web of Open Data. In Proceedings of the 6th International Semantic Web Conference, Busan, Korea.
- Banerjee, S. and Ramanathan, K. and Gupta, A. (2007) Clustering short texts using Wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, pp. 787-788.
- Barr, J. and Cabrera, L.F. (2006) AI gets a brain. In ACM Queue 4(4), pp. 24-29.
- David, C., L. Giroux, S. Bertrand-Gastaldy, and D. Lanteigne (1995) Indexing as problem solving: A cognitive approach to consistency. In Proceedings of the ASIS Annual Meeting, Medford, NJ, pp. 49-55.
- Dolan, S. (2008) Six Degrees of Wikipedia. Retrieved June 2008 from www.netsoc.tcd.ie/~mu/wiki/
- Drenner, S., Harper, M., Frankowski, D., Riedl, J. and Terveen, L. (2006) Insert movie reference here: a system to bridge conversation and item-oriented web sites. In Proceedings of the SIGCHI conference on Human Factors in computing systems, New York, NY, pp. 951-954
- Gabrilovich, E. and Markovitch, S. (2007) Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, Boston, MA.
- Howe, J. (2006) The Rise of Crowdsourcing. In Wired Magazine 14(6).
- Lih, A. (2004) Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, Texas.
- Maron, M.E. (1977) On indexing, retrieval and the meaning of about. In Journal of the American Society for Information Science 28(1), pp. 38-43
- Medelyan, O., Witten, I.H. and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.
- Mihalcea, R. and Csomai, A. (2007) Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge management (CIKM’07), Lisbon, Portugal, pp. 233-242
- Milne, D., Witten, I.H. and Nichols, D.M. (2007). A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2007), Lisbon, Portugal.
- Milne, D., and Witten, I.H. (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.
- Mossberg, W. (2001) New Windows XP Feature Can ReEdit Others' Sites. The Wall Street Journal, June 2001
- Ponzetto, S.P. and Strube, M. (2007) Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence (AAAI’07), Vancouver, British Columbia, pp. 1440-1445.
- Quinlan, J.R. (1993) C4. 5: Programs for Machine Learning. Morgan Kaufmann
- Suchanek, F.M. and Kasneci, G. and Weikum, G. (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW’07), Alberta, Canada, pp. 697-706.
- Völkel, M. and Krötzsch, M. and Vrandecic, D. and Haller, H. and Studer, R. (2006) Semantic Wikipedia. In Proceedings of the 15th international conference on World Wide Web (WWW’06), Edinburgh, Scotland, pp. 585-594