Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling
EMNLP/IJCNLP (1), pp.4237-4247, (2019)
Contextualized word embeddings such as ELMo and BERT provide a foundation for strong performance across a wide range of natural language processing tasks by pretraining on large corpora of unlabeled text. However, the applicability of this approach is unknown when the target domain varies substantially from the pretraining corpus. We are ...More
PPT (Upload PPT)
- Contextualized word embeddings are becoming a ubiquitous component of natural language processing (Dai and Le, 2015; Devlin et al, 2019; Howard and Ruder, 2018; Radford et al, 2018; Peters et al, 2018).
- All three corpora consist exclusively of text written since the late 20th century; Wikipedia and newstext are subject to restrictive stylistic constraints (Bryant et al, 2005).2 It is crucial to determine whether these pretrained models are transferable to texts from other periods or other stylistic traditions, such as historical documents, technical research papers, and social media
- Contextualized word embeddings are becoming a ubiquitous component of natural language processing (Dai and Le, 2015; Devlin et al, 2019; Howard and Ruder, 2018; Radford et al, 2018; Peters et al, 2018)
- We show that a BERT-based partof-speech tagger outperforms the state-of-the-art unsupervised domain adaptation method (Yang and Eisenstein, 2016), without taking any explicit steps to adapt to the target domain of Early Modern English
- We evaluate on the task of part-of-speech tagging in the Penn Parsed Corpus of Early Modern English (PPCEME)
- Because we focus on unsupervised domain adaptation, it is not possible to produce tags in the historical English (PPCHE) tagset, which is not encountered at training time
- AdaptaBERT yields marginal improvements when domain-adaptive fine-tuning is performed on the Workshop on Noisy User Text (WNUT) training set; expanding the target domain data with an additional million unlabeled tweets yields a 2.3% improvement over the BERT baseline
- This paper demonstrates the applicability of contextualized word embeddings to two difficult unsupervised domain adaptation tasks
- Fine-tuning to the task and domain each yield significant improvements in performance over the Frozen BERT baseline (Table 2, line 1).
- It is unsurprising that test set adaptation yields significant improvements, since it can yield useful representations of the names of the relevant entities, which might not appear in a random sample of tweets.
- This is a plausible approach for researchers who are interested in finding the key entities participating in such events in an pre-selected corpus of text
- This paper demonstrates the applicability of contextualized word embeddings to two difficult unsupervised domain adaptation tasks.
- A potentially interesting side note is that while supervised fine-tuning in the target domain results in catastrophic forgetting of the source domain, unsupervised target domain tuning does not.
- This suggests the intriguing possibility of training a single contextualized embedding model that works well across a wide range of domains, genres, and writing styles.
- The authors are interested to more thoroughly explore how to combine domain-adaptive and task-specific fine-tuning within the framework of continual learning (Yogatama et al, 2019), with the goal of balancing between these apparently conflicting objectives
- Table1: Overview of domain tuning and task tuning
- Table2: Tagging accuracy on PPCEME, using the coarse-grained tagset. The unsupervised systems never see labeled data in the target domain of Early Modern English. † in line 4, “in-vocab” and “out-of-vocab” refer to the PPCEME training set vocabulary; for lines 1-3, this refers the PTB training set
- Table3: Tagging accuracy on PPCEME, using the full PTB tagset to compare with <a class="ref-link" id="cYang_2016_a" href="#rYang_2016_a">Yang and Eisenstein (2016</a>)
- Table4: Named entity segmentation performance on the WNUT test set and CONLL test set A. <a class="ref-link" id="cLimsopatham_2016_a" href="#rLimsopatham_2016_a">Limsopatham and Collier (2016</a>) had the winning system at the 2016 WNUT shared task. Their results are reprinted from their paper, which did not report performance on the CONLL dataset
- Adaptation in neural sequence labeling. Most prior work on adapting neural networks for NLP has focused on supervised domain adaptation, in which a labeled data is available in the target domain (Mou et al, 2016). RNN-based models for sequence labeling can be adapted across domains by manipulating the input or output layers individually (e.g., Yang et al, 2016) or simultaneously (Lin and Lu, 2018). Unlike these approaches, we tackle unsupervised domain adaptation, which assumes only unlabeled instances in the target domain. In this setting, prior work has focused on domain-adversarial objectives, which construct an auxiliary loss based on the capability of an adversary to learn to distinguish the domains based on a shared encoding of the input (Ganin et al, 2016; Purushotham et al, 2017). However, adversarial methods require balancing between at least two and as many as six different objectives (Kim et al, 2017), which can lead to instability (Arjovsky et al, 2017) unless the objectives are carefully balanced (Alam et al, 2018).
- The research was supported by the National Science Foundation under award RI-1452443
- Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with adversarial training and graph embeddings. In Proceedings of the Association for Computational Linguistics (ACL), pages 1077–1087.
- Martin Arjovsky, Soumith Chintala, and Leon Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 214–223.
- Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 201How noisy social media text, how diffrnt social media sources. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pages 356–364.
- Alistair Baron and Paul Rayson. 2008. Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate conference in corpus linguistics.
- Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676.
- Susan L Bryant, Andrea Forte, and Amy Bruckman. 2005. Becoming Wikipedian: transformation of participation in a collaborative online encyclopedia. In Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work, pages 1–10. ACM.
- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
- Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Neural Information Processing Systems (NIPS), pages 3079–3087.
- Xiang Dai, Sarvnaz Karimi, Ben Hachey, and Cecile Paris. 201Using similarity measures to select pretraining data for NER. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101–126.
- Stefania Degaetano-Ortlieb. 2018. Stylistic variation over 200 years of court proceedings according to gender and social class. In Proceedings of the Second Workshop on Stylistic Variation, pages 1–10.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Jacob Eisenstein. 20What to do about bad language on the internet. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 359–369.
- Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the Association for Computational Linguistics (ACL).
- Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(59):1–35.
- Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644.
- Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pages 421–432.
- Martin Hilpert and Stefan Th Gries. 2016. Quantitative approaches to diachronic corpus linguistics. The Cambridge handbook of English historical linguistics, pages 36–53.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the Association for Computational Linguistics (ACL), pages 328–339.
- Young-Bum Kim, Karl Stratos, and Dongchan Kim. 2017. Adversarial adaptation of synthetic or stale data. In Proceedings of the Association for Computational Linguistics (ACL), pages 1297–1307.
- Diertani. 2004. Penn-Helsinki Parsed Corpus of Early Modern English.
- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746.
- Nut Limsopatham and Nigel Collier. 2016. Bidirectional LSTM for named entity recognition in twitter messages. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pages 145–152, Osaka, Japan. The COLING 2016 Organizing Committee.
- Bill Yuchen Lin and Wei Lu. 2018. Neural adaptation layers for cross-domain named entity recognition. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109– 165. Elsevier.
- Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, et al. 2011. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182.
- Taesun Moon and Jason Baldridge. 2007. Part-ofspeech tagging for middle English through alignment and projection of parallel diachronic texts. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pages 390–399.
- Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in NLP applications? In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pages 479–489.
- Aditi Muralidharan and Marti A Hearst. 2013. Supporting exploratory text analysis in literature study. Literary and linguistic computing, 28(2):283–295.
- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 7–14, Florence, Italy. Association for Computational Linguistics.
- Sanjay Purushotham, Wilka Carvalho, Tanachat Nilanon, and Yan Liu. 2017. Variational recurrent adversarial deep domain adaptation. In Proceedings of the International Conference on Learning Representations (ICLR).
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.
- Alexander Robertson and Sharon Goldwater. 2018. Evaluating historical text normalization systems: How well do they generalize? In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 720– 725.
- Benjamin Strauss, Bethany Toma, Alan Ritter, MarieCatherine De Marneffe, and Wei Xu. 2016. Results of the wnut16 named entity recognition shared task. In Proceedings of the 2nd Workshop on Noisy Usergenerated Text (WNUT), pages 138–144.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Conference on Natural Language Learning (CoNLL), pages 142–147.
- Romain Vuillemot, Tanya Clement, Catherine Plaisant, and Amit Kumar. 2009. What’s being said near “Martha”? Exploring name entities in literary text collections. In Symposium on Visual Analytics Science and Technology, pages 107–114. IEEE.
- Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848.
- Yi Yang and Jacob Eisenstein. 2015. Unsupervised multi-domain adaptation with feature embeddings. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Yi Yang and Jacob Eisenstein. 2016. Part-of-speech tagging for historical English. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL).
- Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. 2016. Transfer learning for sequence tagging with hierarchical recurrent networks. In Proceedings of the International Conference on Learning Representations (ICLR).
- Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the International Conference on Computer Vision (ICCV), pages 19–27.
- Yftah Ziser and Roi Reichart. 2018. Pivot based language modeling for improved neural domain adaptation. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1241–1251.