AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
In contrast to previous work, we have demonstrated the utility of features based on Twitterspecific POS taggers and Shallow Parsers in segmenting Named Entities

Named entity recognition in tweets: an experimental study

EMNLP, pp.1524-1534, (2011)

Cited by: 1441|Views267
EI
Full Text
Bibtex
Weibo

Abstract

People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with p...More

Code:

Data:

Introduction
  • The authors find that classifying named entities in tweets is

    Status Messages posted on Social Media websites such as Facebook and Twitter present a new and challenging style of text for language technology due to their noisy and informal nature.
  • C 2011 Association for Computational Linguistics knowledge
  • To address these issues the authors propose a distantly supervised approach which applies LabeledLDA (Ramage et al, 2009) to leverage large amounts of unlabeled data in addition to large dictionaries of entities gathered from Freebase, and combines information about an entity’s context across its mentions.
  • By utilizing in-domain, outof-domain, and unlabeled data the authors are able to substantially boost performance, for example obtaining a 52% increase in F1 score on segmenting named entities
Highlights
  • We find that classifying named entities in tweets is

    Status Messages posted on Social Media websites such as Facebook and Twitter present a new and challenging style of text for language technology due to their noisy and informal nature
  • In contrast to previous work, we have demonstrated the utility of features based on Twitterspecific POS taggers and Shallow Parsers in segmenting Named Entities
  • In addition we take a distantly supervised approach to Named Entity Classification which exploits large dictionaries of entities gathered from Freebase, requires no manually annotated data, and as a result is able to handle a larger number of types than previous work
  • We found manually annotated data to be very beneficial for named entity segmentation, we were motivated to explore approaches that don’t rely on manual labels for classification due to Twitter’s wide range of named entity types
  • A plethora of distinctive named entity types are present, necessitating large amounts of training data. To address both these issues we have presented and evaluated a distantly supervised approach based on LabeledLDA, which obtains a 25% increase in F1 score over the co-training approach to Named Entity Classification suggested by Collins and Singer (1999) when applied to Twitter
Results
  • ITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, vised baseline which applies a MaxEnt classifier usand OTHER.
  • Note that these type annotations are ing 4-fold cross validation over the 1,450 entities only used for evaluation purposes, and not used dur- which were annotated for testing.
Conclusion
  • 2011) has proposed lexical normalization of tweets which may be useful as a preprocessing step for the upstream tasks like POS tagging and NER.
  • In addition the authors take a distantly supervised approach to Named Entity Classification which exploits large dictionaries of entities gathered from Freebase, requires no manually annotated data, and as a result is able to handle a larger number of types than previous work.
  • The authors found manually annotated data to be very beneficial for named entity segmentation, the authors were motivated to explore approaches that don’t rely on manual labels for classification due to Twitter’s wide range of named entity types.
  • Unlike previous work on NER in informal text, the approach allows the sharing of information across an entity’s mentions which is quite beneficial due to Twitter’s terse nature
Tables
  • Table1: Examples of noisy text in tweets
  • Table2: POS tagging performance on tweets. By training on in-domain labeled data, in addition to annotated IRC chat data, we obtain a 41% reduction in error over the Stanford POS tagger
  • Table3: Most common errors made by the Stanford POS Tagger on tweets. For each case we list the fraction of times the gold tag is misclassified as the predicted for both our system and the Stanford POS tagger. All verbs are collapsed into VB for compactness
  • Table4: Token-Level accuracy at shallow parsing tweets. is informative. To this end, we build a capitalizaWe compare against the OpenNLP chunker as a baseline. tion classifier, T-CAP, which predicts whether or not a tweet is informatively capitalized. Its output is verb phrases, and prepositional phrases in text. Accurate shallow parsing of tweets could benefit several applications such as Information Extraction and Named Entity Recognition
  • Table5: Performance at predicting reliable capitalization
  • Table6: Performance at segmenting entities varying the eral, news-trained Named Entity Recognizers seem features used. “None” removes POS, Chunk, and capitalto rely heavily on capitalization, which we know to ization features. Overall we obtain a 52% improvement be unreliable in tweets
  • Table7: Example type lists produced by LabeledLDA. No entities which are shown were found in Freebase; these are typically either too new to have been added, or are misspelled/abbreviated (for example rhobh=”Real Housewives of Beverly Hills”). In a few cases there are segmentation errors
  • Table8: Named Entity Classification performance on the 10 types. Assumes segmentation is given as in (<a class="ref-link" id="cCollins_1999_a" href="#rCollins_1999_a">Collins and Singer, 1999</a>), and (<a class="ref-link" id="cElsner_et+al_2009_a" href="#rElsner_et+al_2009_a">Elsner et al, 2009</a>)
  • Table9: F1 classification scores for the 3 MUC types PERSON, LOCATION, ORGANIZATION. Results are shown using LabeledLDA (LL), Freebase Baseline (FB), DL-Cotrain (CT) and Supervised Baseline (SP). N is the number of entities in the test set
  • Table10: F1 scores for classification broken down by type for LabeledLDA (LL), Freebase Baseline (FB), DLCotrain (CT) and Supervised Baseline (SP). N is the number of entities in the test set
  • Table11: Comparing LabeledLDA and DL-Cotrain grouping unlabeled data by entities vs. mentions
  • Table12: Performance at predicting both segmentation and classification. Systems labeled with PLO are evaluated on the 3 MUC types PERSON, LOCATION, ORGANIZATION
Download tables as Excel
Related work
  • DL-Cotrain-entity

    There has been relatively little previous work on building NLP tools for Twitter or similar text styles. Locke and Martin (2009) train a classifier to recognize named entities based on annotated Twitter data, handling the types PERSON, LOCATION, and OR-

    GANIZATION. Developed in parallel to our work, Liu et al (2011) investigate NER on the same 3 types, in addition to PRODUCTs and present a semisupervised approach using k-nearest neighbor. Also ing topic models (e.g. LabeledLDA) for classifying developed in parallel, Gimpell et al (2011) build a named entities has a similar effect, in that informa-

    POS tagger for tweets using 20 coarse-grained tags. tion about an entity’s distribution of possible types

    Benson et. al. (2011) present a system which ex- is shared across its mentions.

    tracts artists and venues associated with musical performances. Recent work (Han and Baldwin, 2011;
Funding
  • The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition
  • Proposes a distantly supervised approach which applies LabeledLDA to leverage large amounts of unlabeled data in addition to large dictionaries of entities gathered from Freebase, and combines information about an entity’s context across its mentions
  • Evaluates the performance of off-the-shelf news trained NLP tools when applied to Twitter
  • Introduces a novel approach to distant super- 2.1 Part of Speech Tagging vision using Topic Models
Reference
  • Edward Benson, Aria Haghighi, and Regina Barzilay. 201Event discovery in social media feeds. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear.
    Google ScholarFindings
  • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res.
    Google ScholarLocate open access versionFindings
  • Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92–100.
    Google ScholarLocate open access versionFindings
  • Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05. Radu Florian. 2002. Named entity recognition as a house of cards: classifier stacking. In Proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Pro-Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Classbased n-gram models of natural language. Comput. Linguist.
    Google ScholarLocate open access versionFindings
  • Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10.
    Google ScholarLocate open access versionFindings
  • Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-ofspeech tagging. In AAAI, pages 784–789.
    Google ScholarLocate open access versionFindings
  • Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In Proceedings of the 20th international joint conference on Artifical intelligence.
    Google ScholarLocate open access versionFindings
  • Doug Downey, Oren Etzioni, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(11):726–748.
    Google ScholarLocate open access versionFindings
  • Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09.
    Google ScholarLocate open access versionFindings
  • Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell.
    Google ScholarLocate open access versionFindings
  • Keller, Justin Martineau, and Mark Dredze. 2010. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming
    Google ScholarFindings
  • Zhou. 2011. Recognizing named entities in tweets.
    Google ScholarFindings
  • sourcing. In Proceedings of the NAACL Workshop on In ACL.
    Google ScholarLocate open access versionFindings
  • Creating Speech and Text Language Data With Ama- Brian Locke and James Martin. 2009. Named entity zon’s Mechanical Turk. Association for Computational recognition: Adapting to microblogging. In Senior
    Google ScholarLocate open access versionFindings
  • Mitchell P. Marcus, Beatrice Santorini, and Mary A. Charles Sutton. 2004. Collective segmentation and la-
    Google ScholarFindings
  • Marcinkiewicz. 1994. Building a large annotated corbeling of distant entities in information extraction.
    Google ScholarFindings
  • pus of english: The penn treebank. Computational Partha Pratim Talukdar and Fernando Pereira. 2010.
    Google ScholarFindings
  • Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu.
    Locate open access versionFindings
  • Tara McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10.
    Google ScholarLocate open access versionFindings
  • Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 443–450, Morristown, NJ, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP 2009.
    Google ScholarLocate open access versionFindings
  • Christoph Muller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany.
    Google ScholarLocate open access versionFindings
  • Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 248–256, Morristown, NJ, USA. Association for Computational methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1473–1481. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, ConLL ’00.
    Google ScholarLocate open access versionFindings
  • Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03.
    Google ScholarLocate open access versionFindings
  • Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10.
    Google ScholarLocate open access versionFindings
  • Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
    Google ScholarLocate open access versionFindings
  • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL ’95.
    Google ScholarLocate open access versionFindings
  • 2010. Incorporating content structure into text analysis applications. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
    Google ScholarLocate open access versionFindings
  • Processing, EMNLP ’10, pages 377–387, Morristown, NJ, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the
    Google ScholarLocate open access versionFindings
  • 2003 Conference of the North American Chapter of the
    Google ScholarFindings
  • Association for Computational Linguistics on Human
    Google ScholarFindings
  • Sameer Singh, Dustin Hillard, and Chris Leggetter. 2010.
    Google ScholarFindings
  • Minimally-supervised extraction of entities from text advertisements. In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT).
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科