SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

    WWW, pp. 178-186, 2003.

    Cited by: 675|Bibtex|Views36|Links
    EI
    Keywords:
    automated semantic annotationsemantic tagseeker platformmillion annotationmillion web pageMore(10+)
    Wei bo:
    It is our goal to provide a tagging of the web as a label bureau

    Abstract:

    This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, publish...More

    Code:

    Data:

    0
    Introduction
    • The WWW has had a tremendous impact on society and business in just a few years by making information instantly and ubiquitously available
    • During this transition from physical to electronic means for information transport, the content and encoding of information has remained natural language.
    • Today, this is perhaps the most significant obstacle to streamlining business processes via the web.
    • The second is large-scale availability of annotations within documents encoding canonical references to mentioned entities
    Highlights
    • The WWW has had a tremendous impact on society and business in just a few years by making information instantly and ubiquitously available
    • Where will the data come from? For the semantic web vision to come to fruition, two classes of meta-data must become extensive and pervasive
    • Because there are several locations in TAP that may be appropriate for a particular entry, the tool checks to see if Taxonomy Based Disambiguation suggested that the spot belongs elsewhere—if so, the tool asks whether the algorithm’s output is a valid answer
    • The purpose of this paper is to describe an approach to largescale automated centralized semantic tagging delivered to consumers through a label bureau
    • We focus immediately on the most common category of annotators, in which the entity type is the page, and the annotator performs some local processing on each web page, and writes back results to the store in the form of an annotation
    • It is our goal to provide a tagging of the web as a label bureau
    Methods
    • The authors first dumped context surrounding each spot
    • The authors processed those contexts as follows: Lexicon generation: The authors built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words.
    • The authors created a final lexicon of 200,000 words from the 1.4 million unique words by taking the most frequent 200,100, and removed the most frequent 100.
    • The authors experimented with several standard similarity measures; the results are given in Section 4.2
    Results
    • The authors implemented the SemTag algorithm described above, and applied it to a set of 264 million pages producing 270G of dump data corresponding to 550 million labels in context.
    Conclusion
    • The authors believe that automated tagging is essential to bootstrap the Semantic Web. As the results of the experiments with SemTag show, it is possible to achieve interestingly high levels of accuracy even with relatively simple approaches to disambiguation.
    • In the future the authors expect that there will be many different approaches and algorithms to automated tagging.
    • It is the goal to provide a tagging of the web as a label bureau.
    • The authors would like to provide Seeker as a public service for the research community to try various experimental approaches for automated tagging
    Summary
    • Introduction:

      The WWW has had a tremendous impact on society and business in just a few years by making information instantly and ubiquitously available
    • During this transition from physical to electronic means for information transport, the content and encoding of information has remained natural language.
    • Today, this is perhaps the most significant obstacle to streamlining business processes via the web.
    • The second is large-scale availability of annotations within documents encoding canonical references to mentioned entities
    • Methods:

      The authors first dumped context surrounding each spot
    • The authors processed those contexts as follows: Lexicon generation: The authors built a collection of 1.4 million unique words occurring in a random subset of windows containing approximately 90 million total words.
    • The authors created a final lexicon of 200,000 words from the 1.4 million unique words by taking the most frequent 200,100, and removed the most frequent 100.
    • The authors experimented with several standard similarity measures; the results are given in Section 4.2
    • Results:

      The authors implemented the SemTag algorithm described above, and applied it to a set of 264 million pages producing 270G of dump data corresponding to 550 million labels in context.
    • Conclusion:

      The authors believe that automated tagging is essential to bootstrap the Semantic Web. As the results of the experiments with SemTag show, it is possible to achieve interestingly high levels of accuracy even with relatively simple approaches to disambiguation.
    • In the future the authors expect that there will be many different approaches and algorithms to automated tagging.
    • It is the goal to provide a tagging of the web as a label bureau.
    • The authors would like to provide Seeker as a public service for the research community to try various experimental approaches for automated tagging
    Tables
    • Table1: Accuracy (probability of correctness) for each algorithm under each vector weighting scheme over test set
    • Table2: Nodes of TAP with percentage of spots occurring in corresponding subtree
    Download tables as Excel
    Reference
    • S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68–88, 1997.
      Google ScholarLocate open access versionFindings
    • R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou. Vinci: A service-oriented architecture for rapid development of web applications. In Proceedings of the Tenth International World Wide Web Conference (WWW10), pages 355–365, Hong Kong, China, 2001.
      Google ScholarLocate open access versionFindings
    • AltaVista. http://www.altavista.com.
      Findings
    • G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1305–1315, Santa Clara, CA, 1997.
      Google ScholarLocate open access versionFindings
    • T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68–88, 2000.
      Google ScholarLocate open access versionFindings
    • D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Acceess Protocol. http://www.w3.org/TR/SOAP/, May 2000.
      Findings
    • D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.
      Findings
    • A. Broder and M. R. Henzinger. Algorithmic aspects of information retrieval on the web. In M. R. J. Abello, P.M. Pardalos, editor, Handbook of Massive Data Sets. Kluwer Academic Publishers, Boston, to appear.
      Google ScholarLocate open access versionFindings
    • C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295–304, Gaithersburg, MD, November 1995.
      Google ScholarLocate open access versionFindings
    • W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI’01), 2001.
      Google ScholarLocate open access versionFindings
    • M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000.
      Google ScholarLocate open access versionFindings
    • Google. http://www.google.com.
      Findings
    • T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.
      Google ScholarLocate open access versionFindings
    • J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.
      Google ScholarFindings
    • J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7–18, 2000.
      Google ScholarLocate open access versionFindings
    • J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina. WebBase: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference (WWW9), pages 277–293, Amsterdam, The Netherlands, 2000.
      Google ScholarLocate open access versionFindings
    • J. Kahan and M.-R. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In World Wide Web, pages 623–632, 2001.
      Google ScholarLocate open access versionFindings
    • N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, 1997.
      Google ScholarLocate open access versionFindings
    • T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.unikarlsruhe.de/positionpapers/Leonard.pdf, 2001.
      Findings
    • K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.
      Google ScholarLocate open access versionFindings
    • G.-A. Levow. Corpus-based techniques for word sense disambiguation. Technical Report AIM-1637, MIT AI Lab, 1, 1997.
      Google ScholarFindings
    • J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.
      Findings
    • [24] D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.
      Google ScholarFindings
    • [25] G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over web views. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT’98), volume LNCS 1377, pages 72–86, Valencia, Spain, 1998. Springer-Verlag.
      Google ScholarLocate open access versionFindings
    • [26] R. Mihalcea. Word sense disambiguation and its application to the internet search. Master’s thesis, Southern Methodist University, 1999.
      Google ScholarFindings
    • [27] A. Newell. Some problems of the basic organization in problem-solving programs. In Proceeding of the Second Conference on Self-Organizing Systems, pages 393–423, Washington, DC, 1962.
      Google ScholarLocate open access versionFindings
    • [28] N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with protege-2000. IEEE Intelligent Systems, 2(16):60–71, 2001.
      Google ScholarLocate open access versionFindings
    • [29] J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artifical Intelligence Conference, Spring Symposium, NLP for WWW, pages 120–128, 1997.
      Google ScholarLocate open access versionFindings
    • [30] R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.
      Findings
    • [31] E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117–124, Providence, RI, 1997.
      Google ScholarLocate open access versionFindings
    • [32] H. Schutze. Automatic word sense discrimination. Computational Linguistics, 24(1):97–124, 1998.
      Google ScholarLocate open access versionFindings
    • [33] E. Spertus and L. A. Stein. Squeal: A structured query language for the web. In Proceedings of the 9th International World Wide Web Conference (WWW9), pages 95–103, Amsterdam, The Netherlands, 2000.
      Google ScholarLocate open access versionFindings
    • [34] S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.
      Google ScholarLocate open access versionFindings
    • [35] The Internet Archive. http://www.archive.org.
      Findings
    • [36] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology driven semi-automatic and automatic support for semantic markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002.
      Google ScholarLocate open access versionFindings
    • [37] W3C. Platform for internet content selection. http://www.w3.org/PICS/.
      Findings
    • [38] W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.
      Findings
    • [39] Web-in-a-Box. http://research.compaq.com/SRC/ WebArcheology/wib.html.
      Findings
    • [40] Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47–51, 1997.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Best Paper
    Best Paper of WWW, 2003
    Tags
    Comments