The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context

CHI, pp. 291-300, 2010.

Cited by: 160|Bibtex|Views33|Links
EI
Keywords:
world knowledgeknowledge diversityknowledge representationsignificant influencediversity presentMore(12+)
Weibo:
This study explores language’s fragmenting effect on usergenerated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions

Abstract:

This study explores language's fragmenting effect on user-generated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in which these concepts are described. We demonstrate tha...More

Code:

Data:

0
Introduction
  • A founding principle of Wikipedia was to encourage consensus around a single neutral point of view [18].
  • Consensus building around a single neutral point of view has been fractured as a result of the Wikipedia Foundation setting up over 250 separate language editions as of this writing
  • It is the goal of this research to illustrate the splintering effect of this “Web 2.0 Tower of Babel”1 and to explicate the positive and negative implications for HCI and AI-based applications that interact with or use Wikipedia data.
  • A few papers [22, 27] attempted to add more interlanguage links between Wikipedias, a topic the authors cover in detail below
Highlights
  • A founding principle of Wikipedia was to encourage consensus around a single neutral point of view [18]
  • Our empirical results suggest that the common encyclopedic core is a minuscule number of concepts and that sub-conceptual knowledge diversity is much greater than one might initially think—drawing a stark contrast with the global consensus hypothesis. In the latter half of this paper, we show how this knowledge diversity can affect core technologies such as information retrieval systems that rely upon Wikipedia-based semantic relatedness measures
  • Generates very different semantic relatedness values depending on the culture whose world knowledge is used
  • Throughout this paper we aimed to examine the veracity of the global consensus hypothesis, quantify the degree of knowledge diversity present across various Wikipedia language editions, and demonstrate the influence this knowledge diversity may have on technology
  • It is possible that information arbitrage would be of little utility for a portion of the Wikipedia articles that only exist in a single language
  • In this paper, we have provided four key contributions: (1) we have shown that knowledge diversity across Wikipedias is large and defined its extent, (2) we have demonstrated that this diversity has a significant effect on technologies, (3) the first census of the effect of language on UGC repositories was executed, and (4) we have discussed design implications of these findings while introducing the ideas of culturally-aware applications and hyperlingual applications
Methods
  • Our experiment on sub-concept diversity borrows from [1] the idea of using outlinks, or links in one article pointing to another article, as a “highly focused entity-based representation of [natural language].” In other words, outlinks4 provide a decent structured, canonical/ languageindependent summary of raw text
  • Operating under this assumption, the authors compared the outlinks of each of the “global concepts” to determine the degree to which the articles covered the same content.
  • Due to the unavailability of standardized tools for certain languages in this study, the authors could only make fair comparisons between ESA implementations based on the following ten languages: Spanish, Hungarian, Norwegian, Portuguese, Romanian, English, German, French, Italian, and Danish
Results
  • The authors' results (Figure 2, Table 3) demonstrate that a surprisingly small amount of concept overlap exists between languages of Wikipedia, refuting the global consensus assumption at the concept level.
  • In the case of the concept that is called “Psychology” in English, for example, the Spanish article (“Psicología”) contains many outlinks to Latin American countries not contained in the German article (“Psychologie”)
  • These links come from a section in the “Psicología” page about Latin America’s contribution to psychology.
  • When C1 = “1945” and C2 = “1947”, all the ESAs returned relatively high values
  • These words occur frequently together in Wikipedia articles, regardless of the language.
  • Many pairs such as “DVD” / “Djibouti” are not related in any language
Conclusion
  • Throughout this paper the authors aimed to examine the veracity of the global consensus hypothesis, quantify the degree of knowledge diversity present across various Wikipedia language editions, and demonstrate the influence this knowledge diversity may have on technology.

    For researchers in HCI, AI and NLP, the rejection of the global consensus hypothesis has important implications for technologies that operate on Wikipedia directly.
  • It is possible that information arbitrage would be of little utility for a portion of the Wikipedia articles that only exist in a single language.In this paper, the authors have provided four key contributions: (1) the authors have shown that knowledge diversity across Wikipedias is large and defined its extent, (2) the authors have demonstrated that this diversity has a significant effect on technologies, (3) the first census of the effect of language on UGC repositories was executed, and (4) the authors have discussed design implications of these findings while introducing the ideas of culturally-aware applications and hyperlingual applications.
  • Moving forward, the authors hope this work will inform and inspire a new generation of multilingual Wikipedia applications
Summary
  • Introduction:

    A founding principle of Wikipedia was to encourage consensus around a single neutral point of view [18].
  • Consensus building around a single neutral point of view has been fractured as a result of the Wikipedia Foundation setting up over 250 separate language editions as of this writing
  • It is the goal of this research to illustrate the splintering effect of this “Web 2.0 Tower of Babel”1 and to explicate the positive and negative implications for HCI and AI-based applications that interact with or use Wikipedia data.
  • A few papers [22, 27] attempted to add more interlanguage links between Wikipedias, a topic the authors cover in detail below
  • Objectives:

    Throughout this paper the authors aimed to examine the veracity of the global consensus hypothesis, quantify the degree of knowledge diversity present across various Wikipedia language editions, and demonstrate the influence this knowledge diversity may have on technology.
  • Methods:

    Our experiment on sub-concept diversity borrows from [1] the idea of using outlinks, or links in one article pointing to another article, as a “highly focused entity-based representation of [natural language].” In other words, outlinks4 provide a decent structured, canonical/ languageindependent summary of raw text
  • Operating under this assumption, the authors compared the outlinks of each of the “global concepts” to determine the degree to which the articles covered the same content.
  • Due to the unavailability of standardized tools for certain languages in this study, the authors could only make fair comparisons between ESA implementations based on the following ten languages: Spanish, Hungarian, Norwegian, Portuguese, Romanian, English, German, French, Italian, and Danish
  • Results:

    The authors' results (Figure 2, Table 3) demonstrate that a surprisingly small amount of concept overlap exists between languages of Wikipedia, refuting the global consensus assumption at the concept level.
  • In the case of the concept that is called “Psychology” in English, for example, the Spanish article (“Psicología”) contains many outlinks to Latin American countries not contained in the German article (“Psychologie”)
  • These links come from a section in the “Psicología” page about Latin America’s contribution to psychology.
  • When C1 = “1945” and C2 = “1947”, all the ESAs returned relatively high values
  • These words occur frequently together in Wikipedia articles, regardless of the language.
  • Many pairs such as “DVD” / “Djibouti” are not related in any language
  • Conclusion:

    Throughout this paper the authors aimed to examine the veracity of the global consensus hypothesis, quantify the degree of knowledge diversity present across various Wikipedia language editions, and demonstrate the influence this knowledge diversity may have on technology.

    For researchers in HCI, AI and NLP, the rejection of the global consensus hypothesis has important implications for technologies that operate on Wikipedia directly.
  • It is possible that information arbitrage would be of little utility for a portion of the Wikipedia articles that only exist in a single language.In this paper, the authors have provided four key contributions: (1) the authors have shown that knowledge diversity across Wikipedias is large and defined its extent, (2) the authors have demonstrated that this diversity has a significant effect on technologies, (3) the first census of the effect of language on UGC repositories was executed, and (4) the authors have discussed design implications of these findings while introducing the ideas of culturally-aware applications and hyperlingual applications.
  • Moving forward, the authors hope this work will inform and inspire a new generation of multilingual Wikipedia applications
Tables
  • Table1: A brief overview of the size of some of the 25 language editions in our study. Other languages included are: Czech, Danish, Finnish, Hungarian, Indonesian, Korean, Polish, Portuguese, Romanian, Slovak, Swedish, Turkish, and Ukrainian. The median number of articles was 225,370
  • Table2: Results from our evaluation of CONCEPTUALIGN and the interlanguage links that power it
  • Table3: Pairwise conceptual coverage overlap. Each cell represents the ratio of concepts in the column’s language edition covered by the row’s language edition
  • Table4: Example “global” concepts (n = 25). Others include “Britney Spears”, “Periodic Table”, and “Milk”
  • Table5: Correlation coefficients between SR values generated by ESA systems based on different languages. This is a subset of the 10x10 matrix generated by the study
Download tables as Excel
Funding
  • This work was supported in part by National Science Foundation grant #0705901 and the Robert and Kaye Hiatt fund
Reference
  • Adafre, S.F. and de Rijke, M. (2006). Finding Similar EACL 2006 Workshop on New Text, Wikis and Blogs and Other Dynamic Text Sources. 62-69.
    Google ScholarFindings
  • Adar, E., Skinner, M. and Weld, D.S. (2009). Information Arbitrage Across Multi-lingual Wikipedia. WSDM '09, 94-103.
    Google ScholarFindings
  • Bergstrom, T. and Karahalios, K. (2009). Conversation clusters: grouping conversation topics through humancomputer dialog. CHI '09, 2349-2352.
    Google ScholarFindings
  • Bolikowski, ". (2009) Scale-free topology of the interlanguage links in Wikipedia. http://arxiv.org/abs/0904.0564.
    Findings
  • Budanitsky, A. and Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32 (1). 13-47.
    Google ScholarLocate open access versionFindings
  • Burke, M. and Kraut, R. (2008). Mopping Up: Modeling Wikipedia Promotion Decisions. CSCW '08, 27-36.
    Google ScholarLocate open access versionFindings
  • Callahan, E. and Herring, S.C. (2009). Cultural Bias in Wikipedia Content on Famous Persons. AoIR 10.0.
    Google ScholarFindings
  • Cimiano, P., Schultz, A., Sizov, S., Sorg, P. and Staab, S., (2009). Explicit Versus Latent Concept Models for Cross-Language Information Retrieval. IJCAI '09, 1513-1518.
    Google ScholarFindings
  • Erdmann, M., Nakayama, K., Hara, T. and Nishio, S. (2008). A Bilingual Dictionary Extracted from the Wikipedia Link Structure. DASFAA ‘08, 686 – 689.
    Google ScholarLocate open access versionFindings
  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G. and Ruppin, E. (2002). Placing Seach in Context: The Concept Revisited. ACM Transactions on Information Systems, 20 (1). 116-131.
    Google ScholarLocate open access versionFindings
  • Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. IJCAI '07, 1606-1611.
    Google ScholarFindings
  • Gabrilovich, E. and Markovitch, S. (2009). Wikipediabased Semantic Interpretation for Natural Language Processing. Journal of Artificial Intelligence Research (JAIR), 34. 443-498.
    Google ScholarLocate open access versionFindings
  • Hassan, S. and Mihalcea, R. (2009). Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. EMNLP’09, 1192-1201.
    Google ScholarFindings
  • Hecht, B. and Gergle, D. (2009). Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories. Communities & Technologies 2009, 1121.
    Google ScholarLocate open access versionFindings
  • Hecht, B. and Raubal, M. (2008). GeoSR: Geographically explore semantic relations in world knowledge. AGILE '08: International Conference on Geographic Information Science, 95 - 114.
    Google ScholarLocate open access versionFindings
  • Kittur, A., Chi, E., Pendleton, B.A., Suh, B. and Mytkowicz, T. (2007). Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie. CHI '07, 1-9.
    Google ScholarFindings
  • Kittur, A. and Kraut, R. (2008). Harnessing the Wisdom of Crowds in Wikipedia: Quality Through Coordination. CSCW '08, 37-46.
    Google ScholarFindings
  • Lih, A. The Wikipedia Revolution: How a Bunch of Nobodies Created the World's Greatest Encyclopedia. Hyperion, 2009.
    Google ScholarLocate open access versionFindings
  • Miller, G.A. and Charles, W.G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6 (1). 1-28.
    Google ScholarLocate open access versionFindings
  • Milne, D. and Witten, I.H. (2008). Learning to Link with Wikipedia. CIKM '08, 1046-1055.
    Google ScholarFindings
  • Muller, M.J. (2007). Comparing tagging vocabularies among four enterprise tag-based services. GROUP '07, 341-350.
    Google ScholarLocate open access versionFindings
  • Oh, J.-H., Kawahara, D., Uchimoto, K., Kazama, J.i. and Torisawa, K. (2008). Enriching Multilingual Language Resources by Discovering Missing CrossLanguage Links in Wikipedia. WI-IAT 2008, 322-328.
    Google ScholarLocate open access versionFindings
  • Ortega, F., Gonzalez-Barahona, J.M. and Robles, G. (2008). On The Inequality of Contributions to Wikipedia. HICSS '08, 304-311.
    Google ScholarLocate open access versionFindings
  • Pedersen, T., Pakhomov, S.V.S., Patwardhand, S. and Chute, C.G. (2007). Meaures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 2007 (40). 288-299.
    Google ScholarLocate open access versionFindings
  • Potthast, M., Stein, B. and Anderka, M. (2008). A Wikipedia-Based Multilingual Retrieval Model. ECIR '08, 522-530.
    Google ScholarLocate open access versionFindings
  • Priedhorsky, R., Chen, J., Lam, S.T., Panciera, K., Terveen, L.G. and Riedl, J. (2007). Creating, Destroying, and Restoring Value in Wikipedia. GROUP 2007.
    Google ScholarLocate open access versionFindings
  • Sorg, P. and Cimiano, P. (2008). Enriching the Crosslingual Link Structure of Wikipedia - A Classification-based Approach. WIKI-AI '08.
    Google ScholarFindings
  • Weld, D.S., Wu, F., Adar, E., Amershi, S., Fogarty, J., Hoffman, R., Patel, K. and Skinner, M. (2008). Intelligence in Wikipedia. AAAI '08.
    Google ScholarFindings
  • Yamashita, N., Inaba, R., Kuzuoka, H. and Ishida, T. (2009). Difficulties in establishing common ground in multiparty groups using machine translation. CHI' 09, 679-688.
    Google ScholarFindings
  • Zesch, T., Müller, C. and Gurevych, I. (2008). Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. LREC '08, 1646-1652.
    Google ScholarFindings
Your rating :
0

 

Best Paper
Best Paper of CHI, 2010
Tags
Comments