Semi-Supervised Sequence Modeling with Cross-View Training

    EMNLP, pp. 1914-1925, 2018.

    Cited by: 85|Bibtex|Views67|Links
    EI
    Keywords:
    Cross-View TrainingCombinatory Categorial Grammarpre trainingcross view trainingNIPSMore(12+)
    Wei bo:
    We propose Cross-View Training, a new method for semi-supervised learning

    Abstract:

    Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View T...More

    Code:

    Data:

    Introduction
    • Deep learning models work best when trained on large amounts of labeled data. acquiring labels is costly, motivating the need for effective semi-supervised learning techniques that leverage unlabeled examples.
    • More recent work trains a Bi-LSTM sentence encoder to do language modeling and incorporates its context-sensitive representations into supervised models (Dai and Le, 2015; Peters et al., 2018).
    • Such pre-training methods perform unsupervised representation learning on a large corpus of unlabeled data followed by supervised training.
    • This paper presents Cross-View Training (CVT), a new self-training algorithm that works well for neural sequence models
    Highlights
    • Deep learning models work best when trained on large amounts of labeled data
    • This paper presents Cross-View Training (CVT), a new self-training algorithm that works well for neural sequence models
    • Cross-View Training can be applied to a variety of tasks and neural architectures, but we focus on sequence modeling tasks where the prediction modules are attached to a shared Bi-LSTM encoder
    • We explore how Cross-View Training scales with dataset size by varying the amount of training data the model has ac-
    • We propose Cross-View Training, a new method for semi-supervised learning
    • We achieve excellent results across seven NLP tasks, especially when Cross-View Training is combined with multi-task learning
    Methods
    • Learning on a Labeled Example "She lives in Washington." BiLSTM Encoder Primary Prediction pθ Module LOCATION loss.
    • Prediction Modules Primary pθ Auxiliary 1 p.
    • Stanford (Luong and Manning, 2015) Google (Luong et al, 2017).
    • Supervised Virtual Adversarial Training* Word Dropout* ELMo* ELMo + Multi-task*† CVT* CVT + Multi-task*† CVT + Multi-task + Large*†.
    • When combining ELMo with multi-task learning, the authors allow each task to learn its own weights for the ELMo embeddings going into each prediction module.
    • The authors found applying dropout to the ELMo embeddings was crucial for achieving good performance
    Results
    • CVT on its own outperforms or is comparable to the best previously published results on all tasks.
    • Figure 3 shows an example win for CVT over supervised learning.
    • Of the prior results listed in Table 1, only TagLM and ELMo are semi-supervised.
    • These methods first train an enormous language model on unlabeled data and incorporate the representations produced by the language model into a supervised classifier.
    Conclusion
    • The authors propose Cross-View Training, a new method for semi-supervised learning.
    • The authors' approach allows models to effectively leverage their own predictions on unlabeled data, training them to produce effective representations that yield accurate predictions even when some of the input is not available.
    • The authors achieve excellent results across seven NLP tasks, especially when CVT is combined with multi-task learning
    Summary
    • Introduction:

      Deep learning models work best when trained on large amounts of labeled data. acquiring labels is costly, motivating the need for effective semi-supervised learning techniques that leverage unlabeled examples.
    • More recent work trains a Bi-LSTM sentence encoder to do language modeling and incorporates its context-sensitive representations into supervised models (Dai and Le, 2015; Peters et al., 2018).
    • Such pre-training methods perform unsupervised representation learning on a large corpus of unlabeled data followed by supervised training.
    • This paper presents Cross-View Training (CVT), a new self-training algorithm that works well for neural sequence models
    • Methods:

      Learning on a Labeled Example "She lives in Washington." BiLSTM Encoder Primary Prediction pθ Module LOCATION loss.
    • Prediction Modules Primary pθ Auxiliary 1 p.
    • Stanford (Luong and Manning, 2015) Google (Luong et al, 2017).
    • Supervised Virtual Adversarial Training* Word Dropout* ELMo* ELMo + Multi-task*† CVT* CVT + Multi-task*† CVT + Multi-task + Large*†.
    • When combining ELMo with multi-task learning, the authors allow each task to learn its own weights for the ELMo embeddings going into each prediction module.
    • The authors found applying dropout to the ELMo embeddings was crucial for achieving good performance
    • Results:

      CVT on its own outperforms or is comparable to the best previously published results on all tasks.
    • Figure 3 shows an example win for CVT over supervised learning.
    • Of the prior results listed in Table 1, only TagLM and ELMo are semi-supervised.
    • These methods first train an enormous language model on unlabeled data and incorporate the representations produced by the language model into a supervised classifier.
    • Conclusion:

      The authors propose Cross-View Training, a new method for semi-supervised learning.
    • The authors' approach allows models to effectively leverage their own predictions on unlabeled data, training them to produce effective representations that yield accurate predictions even when some of the input is not available.
    • The authors achieve excellent results across seven NLP tasks, especially when CVT is combined with multi-task learning
    Tables
    • Table1: Results on the test sets. We report the mean score over 5 runs. Standard deviations in score are around 0.1 for NER, FGN, and translation, 0.02 for POS, and 0.05 for the other tasks. See the supplementary materials for results with them included. The +Large model has four times as many hidden units as the others, making it similar in size to the models when ELMo is included. * denotes semi-supervised and † denotes multi-task
    • Table2: Dev set performance of multi-task CVT with and without producing all-tasks-labeled examples
    • Table3: Ablation study on auxiliary prediction modules for sequence tagging
    • Table4: Comparison of single-task models on the dev sets. “CVT-MT frozen” means we pretrain a CVT + multi-task model on five tasks, and then train only the prediction module for the sixth. “ELMo frozen” means we train prediction modules (but no LSTMs) on top of ELMo embeddings
    Download tables as Excel
    Related work
    Funding
    • Kevin is supported by a Google PhD Fellowship
    Reference
    • Philip Bachman, Ouais Alsharif, and Doina Precup. 2014. Learning with pseudo-ensembles. In NIPS.
      Google ScholarFindings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
      Google ScholarFindings
    • Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In COLT. ACM.
      Google ScholarFindings
    • Rich Caruana. 1997. Multitask learning. Machine Learning, 28:41–75.
      Google ScholarLocate open access versionFindings
    • Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 201The IWSLT 2015 evaluation campaign. In International Workshop on Spoken Language Translation.
      Google ScholarLocate open access versionFindings
    • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH.
      Google ScholarFindings
    • Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics.
      Google ScholarFindings
    • Ronan Collobert and Jason Weston. 200A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML.
      Google ScholarFindings
    • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research.
      Google ScholarLocate open access versionFindings
    • Alexis Conneau, Douwe Kiela, Holger Schwenk, Loıc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP.
      Google ScholarFindings
    • Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In NIPS.
      Google ScholarFindings
    • Carl Doersch and Andrew Zisserman. 2017. Multitask self-supervised visual learning. arXiv preprint arXiv:1708.07860.
      Findings
    • Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency parsing. In ICLR.
      Google ScholarFindings
    • Alex Graves and Jurgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610.
      Google ScholarLocate open access versionFindings
    • Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In EMNLP.
      Google ScholarFindings
    • Felix Hill, Kyunghyun Cho, and Anna Korhonen. 20Learning distributed representations of sentences from unlabelled data. In HLT-NAACL.
      Google ScholarFindings
    • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
      Findings
    • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
      Google ScholarLocate open access versionFindings
    • Julia Hockenmaier and Mark Steedman. 2007. CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn treebank. Computational Linguistics, 33(3):355–396.
      Google ScholarLocate open access versionFindings
    • Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In HLT-NAACL.
      Google ScholarLocate open access versionFindings
    • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL.
      Google ScholarFindings
    • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsupervised auxiliary tasks. In ICLR.
      Google ScholarFindings
    • Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. 2009. What is the best multi-stage architecture for object recognition? In IEEE Conference on Computer Vision.
      Google ScholarLocate open access versionFindings
    • Yoon Kim and Alexander M. Rush. 2016. Sequencelevel knowledge distillation. In EMNLP.
      Google ScholarFindings
    • James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114 13:3521–3526.
      Google ScholarLocate open access versionFindings
    • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
      Google ScholarLocate open access versionFindings
    • Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In ICLR.
      Google ScholarFindings
    • Yann LeCun, Koray Kavukcuoglu, and Clement Farabet. 2010. Convolutional networks and applications in vision. In ISCAS. IEEE.
      Google ScholarLocate open access versionFindings
    • Zhizhong Li and Derek Hoiem. 2016. Learning without forgetting. In ECCV.
      Google ScholarFindings
    • Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt.
      Findings
    • Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In ICLR.
      Google ScholarFindings
    • Minh-Thang Luong and Christopher D. Manning. 2015. Stanford neural machine translation systems for spoken language domains. In IWSLT.
      Google ScholarFindings
    • Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP.
      Google ScholarFindings
    • Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNCRF. In ACL.
      Google ScholarFindings
    • Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stackpointer networks for dependency parsing. In ACL.
      Google ScholarFindings
    • Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The Penn treebank. Computational linguistics, 19(2):313–330.
      Google ScholarLocate open access versionFindings
    • Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS.
      Google ScholarLocate open access versionFindings
    • David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In ACL.
      Google ScholarFindings
    • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
      Google ScholarFindings
    • Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2017. Adversarial training methods for semisupervised text classification. In ICLR.
      Google ScholarFindings
    • Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. 2016. Distributional smoothing with virtual adversarial training. In ICLR.
      Google ScholarFindings
    • Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In ACL.
      Google ScholarFindings
    • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
      Google ScholarFindings
    • Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.
      Google ScholarFindings
    • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
      Findings
    • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. https://blog.openai.com/language-unsupervised.
      Findings
    • Prajit Ramachandran, Peter J Liu, and Quoc V Le. 2017. Unsupervised pretraining for sequence to sequence learning. In EMNLP.
      Google ScholarFindings
    • Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. In ACL.
      Google ScholarFindings
    • Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In EMNLP.
      Google ScholarFindings
    • Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
      Findings
    • Sebastian Ruder and Barbara Plank. 2018. Strong baselines for neural semi-supervised learning under domain shift. In ACL.
      Google ScholarFindings
    • Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In NIPS.
      Google ScholarFindings
    • H Scudder. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371.
      Google ScholarLocate open access versionFindings
    • Vikas Sindhwani and Mikhail Belkin. 2005. A coregularization approach to semi-supervised learning with multiple views. In ICML Workshop on Learning with Multiple Views.
      Google ScholarFindings
    • Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In ACL.
      Google ScholarFindings
    • Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate sequence labeling with iterated dilated convolutions. In EMNLP.
      Google ScholarFindings
    • Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR.
      Google ScholarFindings
    • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
      Google ScholarFindings
    • Antti Tarvainen and Harri Valpola. 2017. Weightaveraged consistency targets improve semisupervised deep learning results. In Workshop on Learning with Limited Labeled Data, NIPS.
      Google ScholarFindings
    • Erik F Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In CoNLL.
      Google ScholarFindings
    • Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In HLT-NAACL.
      Google ScholarFindings
    • Xiang Wei, Zixia Liu, Liqiang Wang, and Boqing Gong. 2018. Improving the improved training of Wasserstein GANs. In ICLR.
      Google ScholarFindings
    • Huijia Wu, Jiajun Zhang, and Chengqing Zong. 2017. Shortcut sequence tagging. arXiv preprint arXiv:1701.00576.
      Findings
    • Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv preprint arXiv:1304.5634.
      Findings
    • David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In ACL.
      Google ScholarFindings
    • Yuan Zhang and David Weiss. 2016. Stackpropagation: Improved representation learning for syntax. In ACL.
      Google ScholarFindings
    • Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments