AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
In contrast to previous work, we have demonstrated the utility of features based on Twitterspecific POS taggers and Shallow Parsers in segmenting Named Entities
Named entity recognition in tweets: an experimental study
EMNLP, pp.1524-1534, (2011)
People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with p...More
PPT (Upload PPT)
- The authors find that classifying named entities in tweets is
Status Messages posted on Social Media websites such as Facebook and Twitter present a new and challenging style of text for language technology due to their noisy and informal nature.
- C 2011 Association for Computational Linguistics knowledge
- To address these issues the authors propose a distantly supervised approach which applies LabeledLDA (Ramage et al, 2009) to leverage large amounts of unlabeled data in addition to large dictionaries of entities gathered from Freebase, and combines information about an entity’s context across its mentions.
- By utilizing in-domain, outof-domain, and unlabeled data the authors are able to substantially boost performance, for example obtaining a 52% increase in F1 score on segmenting named entities
- We find that classifying named entities in tweets is
Status Messages posted on Social Media websites such as Facebook and Twitter present a new and challenging style of text for language technology due to their noisy and informal nature
- In contrast to previous work, we have demonstrated the utility of features based on Twitterspecific POS taggers and Shallow Parsers in segmenting Named Entities
- In addition we take a distantly supervised approach to Named Entity Classification which exploits large dictionaries of entities gathered from Freebase, requires no manually annotated data, and as a result is able to handle a larger number of types than previous work
- We found manually annotated data to be very beneficial for named entity segmentation, we were motivated to explore approaches that don’t rely on manual labels for classification due to Twitter’s wide range of named entity types
- A plethora of distinctive named entity types are present, necessitating large amounts of training data. To address both these issues we have presented and evaluated a distantly supervised approach based on LabeledLDA, which obtains a 25% increase in F1 score over the co-training approach to Named Entity Classification suggested by Collins and Singer (1999) when applied to Twitter
- ITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, vised baseline which applies a MaxEnt classifier usand OTHER.
- Note that these type annotations are ing 4-fold cross validation over the 1,450 entities only used for evaluation purposes, and not used dur- which were annotated for testing.
- 2011) has proposed lexical normalization of tweets which may be useful as a preprocessing step for the upstream tasks like POS tagging and NER.
- In addition the authors take a distantly supervised approach to Named Entity Classification which exploits large dictionaries of entities gathered from Freebase, requires no manually annotated data, and as a result is able to handle a larger number of types than previous work.
- The authors found manually annotated data to be very beneficial for named entity segmentation, the authors were motivated to explore approaches that don’t rely on manual labels for classification due to Twitter’s wide range of named entity types.
- Unlike previous work on NER in informal text, the approach allows the sharing of information across an entity’s mentions which is quite beneficial due to Twitter’s terse nature
- Table1: Examples of noisy text in tweets
- Table2: POS tagging performance on tweets. By training on in-domain labeled data, in addition to annotated IRC chat data, we obtain a 41% reduction in error over the Stanford POS tagger
- Table3: Most common errors made by the Stanford POS Tagger on tweets. For each case we list the fraction of times the gold tag is misclassified as the predicted for both our system and the Stanford POS tagger. All verbs are collapsed into VB for compactness
- Table4: Token-Level accuracy at shallow parsing tweets. is informative. To this end, we build a capitalizaWe compare against the OpenNLP chunker as a baseline. tion classifier, T-CAP, which predicts whether or not a tweet is informatively capitalized. Its output is verb phrases, and prepositional phrases in text. Accurate shallow parsing of tweets could benefit several applications such as Information Extraction and Named Entity Recognition
- Table5: Performance at predicting reliable capitalization
- Table6: Performance at segmenting entities varying the eral, news-trained Named Entity Recognizers seem features used. “None” removes POS, Chunk, and capitalto rely heavily on capitalization, which we know to ization features. Overall we obtain a 52% improvement be unreliable in tweets
- Table7: Example type lists produced by LabeledLDA. No entities which are shown were found in Freebase; these are typically either too new to have been added, or are misspelled/abbreviated (for example rhobh=”Real Housewives of Beverly Hills”). In a few cases there are segmentation errors
- Table8: Named Entity Classification performance on the 10 types. Assumes segmentation is given as in (<a class="ref-link" id="cCollins_1999_a" href="#rCollins_1999_a">Collins and Singer, 1999</a>), and (<a class="ref-link" id="cElsner_et+al_2009_a" href="#rElsner_et+al_2009_a">Elsner et al, 2009</a>)
- Table9: F1 classification scores for the 3 MUC types PERSON, LOCATION, ORGANIZATION. Results are shown using LabeledLDA (LL), Freebase Baseline (FB), DL-Cotrain (CT) and Supervised Baseline (SP). N is the number of entities in the test set
- Table10: F1 scores for classification broken down by type for LabeledLDA (LL), Freebase Baseline (FB), DLCotrain (CT) and Supervised Baseline (SP). N is the number of entities in the test set
- Table11: Comparing LabeledLDA and DL-Cotrain grouping unlabeled data by entities vs. mentions
- Table12: Performance at predicting both segmentation and classification. Systems labeled with PLO are evaluated on the 3 MUC types PERSON, LOCATION, ORGANIZATION
There has been relatively little previous work on building NLP tools for Twitter or similar text styles. Locke and Martin (2009) train a classifier to recognize named entities based on annotated Twitter data, handling the types PERSON, LOCATION, and OR-
GANIZATION. Developed in parallel to our work, Liu et al (2011) investigate NER on the same 3 types, in addition to PRODUCTs and present a semisupervised approach using k-nearest neighbor. Also ing topic models (e.g. LabeledLDA) for classifying developed in parallel, Gimpell et al (2011) build a named entities has a similar effect, in that informa-
POS tagger for tweets using 20 coarse-grained tags. tion about an entity’s distribution of possible types
Benson et. al. (2011) present a system which ex- is shared across its mentions.
tracts artists and venues associated with musical performances. Recent work (Han and Baldwin, 2011;
- The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition
- Proposes a distantly supervised approach which applies LabeledLDA to leverage large amounts of unlabeled data in addition to large dictionaries of entities gathered from Freebase, and combines information about an entity’s context across its mentions
- Evaluates the performance of off-the-shelf news trained NLP tools when applied to Twitter
- Introduces a novel approach to distant super- 2.1 Part of Speech Tagging vision using Topic Models
- Edward Benson, Aria Haghighi, and Regina Barzilay. 201Event discovery in social media feeds. In The 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA. To appear.
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res.
- Avrim Blum and Tom M. Mitchell. 1998. Combining labeled and unlabeled sata with co-training. In COLT, pages 92–100.
- Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05. Radu Florian. 2002. Named entity recognition as a house of cards: classifier stacking. In Proceedings of the 6th conference on Natural language learning - Volume 20, COLING-02. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Pro-Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Classbased n-gram models of natural language. Comput. Linguist.
- Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10.
- Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-ofspeech tagging. In AAAI, pages 784–789.
- Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Empirical Methods in Natural Language Processing.
- Doug Downey, Matthew Broadhead, and Oren Etzioni. 2007. Locating complex named entities in web text. In Proceedings of the 20th international joint conference on Artifical intelligence.
- Doug Downey, Oren Etzioni, and Stephen Soderland. 2010. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(11):726–748.
- Micha Elsner, Eugene Charniak, and Mark Johnson. 2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09.
- Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell.
- Keller, Justin Martineau, and Mark Dredze. 2010. Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming
- Zhou. 2011. Recognizing named entities in tweets.
- sourcing. In Proceedings of the NAACL Workshop on In ACL.
- Creating Speech and Text Language Data With Ama- Brian Locke and James Martin. 2009. Named entity zon’s Mechanical Turk. Association for Computational recognition: Adapting to microblogging. In Senior
- Mitchell P. Marcus, Beatrice Santorini, and Mary A. Charles Sutton. 2004. Collective segmentation and la-
- Marcinkiewicz. 1994. Building a large annotated corbeling of distant entities in information extraction.
- pus of english: The penn treebank. Computational Partha Pratim Talukdar and Fernando Pereira. 2010.
- Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. In http://mallet.cs.umass.edu.
- Tara McIntosh. 2010. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10.
- Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: applying named entity recognition to informal text. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 443–450, Morristown, NJ, USA. Association for Computational Linguistics.
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of ACL-IJCNLP 2009.
- Christoph Muller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn, and Joybrato Mukherjee, editors, Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pages 197–214. Peter Lang, Frankfurt a.M., Germany.
- Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 248–256, Morristown, NJ, USA. Association for Computational methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1473–1481. Association for Computational Linguistics.
- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7, ConLL ’00.
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03.
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10.
- Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
- David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL ’95.
- 2010. Incorporating content structure into text analysis applications. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
- Processing, EMNLP ’10, pages 377–387, Morristown, NJ, USA. Association for Computational Linguistics.
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the
- 2003 Conference of the North American Chapter of the
- Association for Computational Linguistics on Human
- Sameer Singh, Dustin Hillard, and Chris Leggetter. 2010.
- Minimally-supervised extraction of entities from text advertisements. In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT).