Investigating the Working of Text Classifiers

    COLING, pp. 2120-2131, 2018.

    Cited by: 3|Bibtex|Views33|Links
    EI
    Keywords:
    natural languageclassification accuracydistributed bag-of-wordsConvolutional Neural Networkneural network modelMore(8+)
    Wei bo:
    Our experiments with popular text classifiers show that there is a large drop in test classification accuracy between random and lexicon splits of these datasets

    Abstract:

    Text classification is one of the most widely studied task in natural language processing. Recently, larger and larger multilayer neural network models are employed for the task motivated by the principle of compositionality. Almost all of the methods reported use discriminative approaches for the task. Discriminative approaches come with...More

    Code:

    Data:

    0
    Introduction
    • Text classification is one of the fundamental tasks in natural language processing (NLP) in which the objective is to categorize text documents into one of the predefined classes.
    • LSTM network pre-trained using language model parameters or sequence autoencoder parameters is used by Dai and Le (2015) for various text classification tasks
    • It was shown by Johnson and Zhang (2015a) that a CNN with dynamic max pooling layer can effectively use the word order structure when trained using one-hot encoding representation of input words.
    • They perform semi-supervised experiments using a simplified LSTM k1
    Highlights
    • Text classification is one of the fundamental tasks in natural language processing (NLP) in which the objective is to categorize text documents into one of the predefined classes
    • We present empirical studies in order to establish that (i) many text classifiers resort to just identifying key lexicons and poorly performing on specially crafted dataset splits (Section 5.3), and simple regularization techniques which disincentivize focusing on key lexicons can significantly boost performance (Section 5.4)
    • A potential drawback is that since all neural network approaches are discriminative, they tend to identify key signals in the training data which may not generalize to test data. We investigate whether these neural network models learn to compose the meaning of sentences or just use discriminative keywords
    • To test the generalization ability of different state-of-the-art text classifiers, we construct hard datasets where the training and test splits have no direct overlap of lexicons
    • Our experiments with popular text classifiers show that there is a large drop in test classification accuracy between random and lexicon splits of these datasets
    • We show that simple regularization techniques such as keyword anonymization can substantially improve the performance of text classifiers
    Methods
    • The authors will first briefly describe a simple neural network for text classification on which the proposed regularization methods are based.

      Let the vocabulary size be V and embedding dimension be D.
    • BiLSTM encoder applied to anonymized training data with random embedding substitution performs around 2% better on Arxiv abstracts dataset, 2.5% better on IMDB reviews dataset and gives around 1% improvement on ACL IMDB lexicon dataset.
    • This shows that keyword anonymization with random embedding substitution can be a good strategy for model regularization in case of lexicon-based split.
    • One of the reasons why this method is more effective is that word dropout partially masks some lexical terms in the training set thereby lowering the variance of the fitted model
    Results
    • In order to estimate the difficulty level on both the lexicon and random splits version of a dataset, the authors do experiments with a wide range of popular approaches and show their results in Table 3.
    • The accuracy gap for this method is largest between both the dataset versions
    • These low scores of above classifiers on lexicon version can be explained owing to the conditional independence assumption among the input features.
    • Bag-of-words based models assign high weights to class-specific keywords
    • When this trained model is not able to spot such discriminative keywords in the test set due to strict no overlap condition, the performance of these methods degrades
    Conclusion
    • Multilayer neural network models have gained wide popularity for text classification tasks due to their much better performance than traditional bag-of-words based approaches.
    • The authors' experiments with popular text classifiers show that there is a large drop in test classification accuracy between random and lexicon splits of these datasets.
    • The authors observe that adaptive word dropout method which is based on embedding layer’s gradient can further improve accuracy and reduce the gap between the two dataset splits
    Summary
    • Introduction:

      Text classification is one of the fundamental tasks in natural language processing (NLP) in which the objective is to categorize text documents into one of the predefined classes.
    • LSTM network pre-trained using language model parameters or sequence autoencoder parameters is used by Dai and Le (2015) for various text classification tasks
    • It was shown by Johnson and Zhang (2015a) that a CNN with dynamic max pooling layer can effectively use the word order structure when trained using one-hot encoding representation of input words.
    • They perform semi-supervised experiments using a simplified LSTM k1
    • Methods:

      The authors will first briefly describe a simple neural network for text classification on which the proposed regularization methods are based.

      Let the vocabulary size be V and embedding dimension be D.
    • BiLSTM encoder applied to anonymized training data with random embedding substitution performs around 2% better on Arxiv abstracts dataset, 2.5% better on IMDB reviews dataset and gives around 1% improvement on ACL IMDB lexicon dataset.
    • This shows that keyword anonymization with random embedding substitution can be a good strategy for model regularization in case of lexicon-based split.
    • One of the reasons why this method is more effective is that word dropout partially masks some lexical terms in the training set thereby lowering the variance of the fitted model
    • Results:

      In order to estimate the difficulty level on both the lexicon and random splits version of a dataset, the authors do experiments with a wide range of popular approaches and show their results in Table 3.
    • The accuracy gap for this method is largest between both the dataset versions
    • These low scores of above classifiers on lexicon version can be explained owing to the conditional independence assumption among the input features.
    • Bag-of-words based models assign high weights to class-specific keywords
    • When this trained model is not able to spot such discriminative keywords in the test set due to strict no overlap condition, the performance of these methods degrades
    • Conclusion:

      Multilayer neural network models have gained wide popularity for text classification tasks due to their much better performance than traditional bag-of-words based approaches.
    • The authors' experiments with popular text classifiers show that there is a large drop in test classification accuracy between random and lexicon splits of these datasets.
    • The authors observe that adaptive word dropout method which is based on embedding layer’s gradient can further improve accuracy and reduce the gap between the two dataset splits
    Tables
    • Table1: Example lexicons for training and test set of Arxiv abstracts and IMDB reviews datasets. The column headers in both the tables indicate the names of various classes in these datasets. For Arxiv abstracts dataset, the details of all the class names can be found in the URL: https://arxiv.org
    • Table2: This table shows the dataset summary statistics. c: Number of classes, l: Average length of a sentence, N : Dataset size, V : Vocabulary size
    • Table3: Classification accuracy of various models on random and lexicon based version of each dataset. We also show the accuracy difference (∆) between these two results. Naïve Bayes: n-gram feature extraction using tf-idf weighting followed by Naïve Bayes classifier. Logistic Regression: n-gram feature extraction using tf-idf weighting followed by Logistic Regression classifier. DocVec: Document vectors trained using DBOW model (<a class="ref-link" id="cLe_2014_a" href="#rLe_2014_a">Le and Mikolov, 2014</a>). FastText: Average of the word and subword embeddings (<a class="ref-link" id="cJoulin_et+al_2017_a" href="#rJoulin_et+al_2017_a">Joulin et al, 2017</a>). Deep Sets (<a class="ref-link" id="cZaheer_et+al_2017_a" href="#rZaheer_et+al_2017_a">Zaheer et al, 2017</a>): two layer MLP on top of word embeddings with max pooling layer. LSTM: Word level LSTM as a document encoder in which hidden state of the last time-step is used for classification. BiLSTM: Word level bidirectional LSTM as a document encoder in which forward LSTM and backward LSTM hidden states are concatenated followed by max pooling layer (<a class="ref-link" id="cConneau_et+al_2017_a" href="#rConneau_et+al_2017_a">Conneau et al, 2017</a>). CNN-MaxPool: CNN with max pooling layer (<a class="ref-link" id="cKim_2014_a" href="#rKim_2014_a">Kim, 2014</a>). CNN-DynMaxPool: CNN with dynamic max pooling layer. For details on dynamic max pooling, we request the reader to refer to <a class="ref-link" id="cJohnson_2015_a" href="#rJohnson_2015_a">Johnson and Zhang (2015a</a>). Adv-Training: Adversarial training of encoder is done to fool the discriminator and make the representation of training and test instances domain invariant. Multi-task Learning: A shared BiLSTM encoder is used to perform joint training of text classifier, denoising autoencoder, and adversarial training. LSTM-Anon: LSTM encoder applied to the anonymized training data. BiLSTM-Anon: BiLSTM encoder applied to the anonymized training data. Adaptive Dropout: BiLSTM encoder applied after embedding gradient-based adaptive word dropout
    Download tables as Excel
    Funding
    • This work was supported by a generous research funding from CMU, MCDS students grant
    Reference
    • [Bickel et al.2007] Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2007. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 81–88, New York, NY, USA. ACM.
      Google ScholarLocate open access versionFindings
    • [Caruana1997] Rich Caruana. 1997. Multitask learning. Mach. Learn., 28(1):41–75, July.
      Google ScholarLocate open access versionFindings
    • [Conneau et al.2017] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Dai and Le2015] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3079–3087, Cambridge, MA, USA. MIT Press.
      Google ScholarLocate open access versionFindings
    • [Frege1948] Gottlob Frege. 1948. Sense and reference. The philosophical review, 57(3):209–230.
      Google ScholarLocate open access versionFindings
    • [Gal and Ghahramani2016] Yarin Gal and Zoubin Ghahramani. 201A theoretically grounded application of dropout in recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1019–1027. Curran Associates, Inc.
      Google ScholarLocate open access versionFindings
    • [Ganin et al.2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17(1):2096–2030, January.
      Google ScholarLocate open access versionFindings
    • [Harris1954] Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
      Google ScholarLocate open access versionFindings
    • [Hermann et al.2015] Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada.
      Google ScholarLocate open access versionFindings
    • [Hill et al.2016] Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
      Google ScholarLocate open access versionFindings
    • [Joachims1998] Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, ECML’98, pages 137–142, Berlin, Heidelberg. Springer-Verlag.
      Google ScholarLocate open access versionFindings
    • [Johnson and Zhang2015a] Rie Johnson and Tong Zhang. 2015a. Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103–112, Denver, Colorado, May–June. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Johnson and Zhang2015b] Rie Johnson and Tong Zhang. 2015b. Semi-supervised convolutional neural networks for text categorization via region embedding. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 919–927, Cambridge, MA, USA. MIT Press.
      Google ScholarLocate open access versionFindings
    • [Johnson and Zhang2016] Rie Johnson and Tong Zhang. 2016. Supervised and semi-supervised text categorization using lstm for region embeddings. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 526–534. JMLR.org.
      Google ScholarLocate open access versionFindings
    • [Joulin et al.2017] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain, April. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
      Findings
    • [Lample et al.2017] Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
      Findings
    • [Le and Mikolov2014] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1188–II–1196. JMLR.org.
      Google ScholarLocate open access versionFindings
    • [Lecun et al.1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov.
      Google ScholarLocate open access versionFindings
    • [Maas et al.2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Manning et al.2008] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
      Google ScholarFindings
    • [McCallum and Nigam1998] Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In Proceedings of AAAI-98, Workshop on Learning for Text Categorization, pages 41–48. AAAI Press.
      Google ScholarLocate open access versionFindings
    • [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3111–3119, USA. Curran Associates Inc.
      Google ScholarLocate open access versionFindings
    • [Pascanu et al.2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org.
      Google ScholarLocate open access versionFindings
    • [Paszke et al.2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
      Google ScholarLocate open access versionFindings
    • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Peters et al.2018] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Schuster and Paliwal1997] M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. Trans. Sig. Proc., 45(11):2673–2681, November.
      Google ScholarLocate open access versionFindings
    • [Shimodaira2000] Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 – 244.
      Google ScholarLocate open access versionFindings
    • [Srivastava et al.2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January.
      Google ScholarLocate open access versionFindings
    • [Wang and Manning2012] Sida Wang and Christopher D. Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pages 90–94, Stroudsburg, PA, USA. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • [Zaheer et al.2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. 2017. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3391– 3401. Curran Associates, Inc.
      Google ScholarLocate open access versionFindings
    • [Zhang et al.2017] Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2017. Aspect-augmented adversarial networks for domain adaptation. Transactions of the Association for Computational Linguistics, 5:515–528.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments