Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN

    WWW '18: The Web Conference 2018 Lyon France April, 2018, pp. 1063-1072, 2018.

    Cited by: 60|Bibtex|Views112|Links
    EI
    Keywords:
    different levelDeep Graph Convolutional Neural NetworksHierarchically Regularized Support Vector MachinesHierarchical Attention Network [45]Support Vector MachinesMore(19+)
    Wei bo:
    We present a deep graph convolutional neural networks model to perform largescale hierarchical text classification

    Abstract:

    Text classification to a hierarchical taxonomy of topics is a common and practical problem. Traditional approaches simply use bag-of-words and have achieved good results. However, when there are a lot of labels with different topical granularities, bag-of-words representation may not be enough. Deep learning models have been proven to be ...More

    Code:

    Data:

    0
    Introduction
    • Topical text classification is a fundamental text mining problem for many applications, such as news classification [18], question

      Recently, deep learning has been proven to be effective to perform end-to-end learning of hierarchical feature representations, and has made groundbreaking progress on object recondition in computer vision and speech recognition problems [24].
    • A simple mechanism is to recursively convolve the nearby lower-level vectors in the sequence to compose higher-level vectors [8]
    • This way of using CNNs evaluates the semantic compositionality of consecutive words, which corresponds to the n-grams used in traditional text modeling [1].
    • Similar to images, such convolution can naturally represent different levels of semantics shown by the text data.
    Highlights
    • Topical text classification is a fundamental text mining problem for many applications, such as news classification [18], question

      Recently, deep learning has been proven to be effective to perform end-to-end learning of hierarchical feature representations, and has made groundbreaking progress on object recondition in computer vision and speech recognition problems [24]
    • recurrent neural networks are more powerful on short messages or word level syntactics or semantics [3]
    • Different from recurrent neural networks, convolutional neural networks use convolutional masks to sequentially convolve over the data
    • We propose a Hierarchically Regularized Deep Graph-convolutional neural networks (HR-Deep Graph Convolutional Neural Networks) framework to tackle the above problems with the following considerations
    • We present a deep graph convolutional neural networks model to perform largescale hierarchical text classification
    • The experiments compared to both traditional state-of-the-art text classification models as well as recently developed deep learning models show that our approach can significantly improve the results on two datasets, RCV1 and NYTimes
    Methods
    • Methods for Comparison

      The authors compare both traditional hierarchical text classification baselines and modern deep learning based classification algorithms.

      Flat baselines: The authors used both Logistic Regression (LR) and Support Vector Machines (SVM) to learn from data.
    • The authors compare both traditional hierarchical text classification baselines and modern deep learning based classification algorithms.
    • Conv5-64 maxpool2 conv5-128 maxpool2 conv5-256 maxpool2 conv5-512 maxpool2 conv5-512 maxpool2 conv5-1024 maxpool2 FC-2048 FC-1024 FC-1024 augment the simple CNN model [20] to be three layers and six layers, which the authors call deep CNN (DCNN-3 and DCNN-6).
    • Deep Graph-CNN Models.
    • The configurations of different CNN layers are shown in Table 2, Types
    Conclusion
    • The authors present a deep graph CNN model to perform largescale hierarchical text classification.
    • The authors leverage the convolution power of semantic composition to generate text document representation for topic classification.
    • The experiments compared to both traditional state-of-the-art text classification models as well as recently developed deep learning models show that the approach can significantly improve the results on two datasets, RCV1 and NYTimes.
    • The authors plan to extend the deep graph CNN model to other complex text classification datasets and applications
    Summary
    • Introduction:

      Topical text classification is a fundamental text mining problem for many applications, such as news classification [18], question

      Recently, deep learning has been proven to be effective to perform end-to-end learning of hierarchical feature representations, and has made groundbreaking progress on object recondition in computer vision and speech recognition problems [24].
    • A simple mechanism is to recursively convolve the nearby lower-level vectors in the sequence to compose higher-level vectors [8]
    • This way of using CNNs evaluates the semantic compositionality of consecutive words, which corresponds to the n-grams used in traditional text modeling [1].
    • Similar to images, such convolution can naturally represent different levels of semantics shown by the text data.
    • Methods:

      Methods for Comparison

      The authors compare both traditional hierarchical text classification baselines and modern deep learning based classification algorithms.

      Flat baselines: The authors used both Logistic Regression (LR) and Support Vector Machines (SVM) to learn from data.
    • The authors compare both traditional hierarchical text classification baselines and modern deep learning based classification algorithms.
    • Conv5-64 maxpool2 conv5-128 maxpool2 conv5-256 maxpool2 conv5-512 maxpool2 conv5-512 maxpool2 conv5-1024 maxpool2 FC-2048 FC-1024 FC-1024 augment the simple CNN model [20] to be three layers and six layers, which the authors call deep CNN (DCNN-3 and DCNN-6).
    • Deep Graph-CNN Models.
    • The configurations of different CNN layers are shown in Table 2, Types
    • Conclusion:

      The authors present a deep graph CNN model to perform largescale hierarchical text classification.
    • The authors leverage the convolution power of semantic composition to generate text document representation for topic classification.
    • The experiments compared to both traditional state-of-the-art text classification models as well as recently developed deep learning models show that the approach can significantly improve the results on two datasets, RCV1 and NYTimes.
    • The authors plan to extend the deep graph CNN model to other complex text classification datasets and applications
    Tables
    • Table1: Dataset Statistics. The training/test split for RCV1 is done by [<a class="ref-link" id="c27" href="#r27">27</a>]. The training/test split for NYTimes is done by ourselves, which is 90% for training and 10% for test. For both of the data, we randomly sample 10% from the training data as development sets
    • Table2: DGCNN Configurations. The convolutional layers parameters are denoted as “conv<receptive field size><number of channels>.”. The results shown in Table 7 have consistent conclusion as RCV1 dataset. The difference of NYTimes dataset is that it has much larger training data. Thus, we observe
    • Table3: Comparison between stemmed and original words on RCV1 dataset
    • Table4: Comparison among different sub-graph numbers (N ) and normalized sub-graph sizes (d) on RCV1 dataset
    • Table5: Comparison of results on RCV1 dataset
    • Table6: Comparison of training time based on GPU and CPU. (Test evaluations for all the models were performed by CPU.)
    • Table7: Comparison of results on NYtimes dataset
    • Table8: Comparsion of training time and results on NYTimes dataset. The evaluations for stand-alone (Native) and recursive hierarchical segmentation (RHS) programs were performed by DGCNN-6 and GPU
    • Table9: Number of parameters (in millions)
    Download tables as Excel
    Related work
    • In this section, we briefly review the related work in following two categories.

      2.1 Traditional Text Classification

      Tradition text classification uses feature engineering (e.g., extracting features beyond BOW) and feature selection to obtain good features for text classification [1]. Dimensionality reduction can also be used to reduce the feature space. For example, Latent Dirichlet Allocation [4] has been used to extract “topics” from corpus, and then represent documents in the topic space. It can be better than BOW when the feature numbers are small. However, when the size of words in vocabulary increases, it does not show advantage over BOW on text classification task [4]. There is also existing work on converting texts to graphs [34, 42]. Similar to us, they used co-occurrence to construct graphs from texts, and then they either applied similarity measure on graph to define new document similarities [42] or applied graph mining algorithms to find frequent sub-graphs in the corpus to define new features for text [34]. Both of them showed some positive results for small label space classification problems, and the cost of graph mining is more than our approach which simply performs breadth-first search.
    Funding
    • This work is supported by NSFC program (No.61472022,61772151,61421003) and partly by the Beijing Advanced Innovation Center for Big Data and Brain Computing
    • Yangqiu Song is supported by China 973 Fundamental R&D Program (No 2014CB340304) and the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No 26206717)
    • Qiang Yang is supported by China 973 Fundamental R&D Program (No 2014CB340304) and Hong Kong CERG projects 16211214, 16209715 and 16244616
    Reference
    • Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Classification Algorithms. In Mining Text Data. 163–222.
      Google ScholarFindings
    • Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multilabel Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages. In WWW. 13–24.
      Google ScholarLocate open access versionFindings
    • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 200A Neural Probabilistic Language Model. Journal of Machine Learning Research 3 (2003), 1137–1155.
      Google ScholarLocate open access versionFindings
    • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
      Google ScholarLocate open access versionFindings
    • Lijuan Cai and Thomas Hofmann. 2004. Hierarchical Document Categorization with Support Vector Machines. In CIKM. 78–87.
      Google ScholarLocate open access versionFindings
    • Hao Chen and Susan Dumais. 2000. Bringing Order to the Web: Automatically Categorizing Search Results. In CHI. 145–152.
      Google ScholarLocate open access versionFindings
    • Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin, and Zhiyuan Liu. 2016. Neural sentiment classification with user and product attention. In EMNLP. 1650– 1659.
      Google ScholarFindings
    • Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12 (2011), 2493–2537.
      Google ScholarLocate open access versionFindings
    • Alexis Conneau, Holger Schwenk, LoÃŕc Barrault, and Yann Lecun. 2016. Very Deep Convolutional Networks for Text Classification. (2016).
      Google ScholarFindings
    • Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In NIPS. 3837–3845.
      Google ScholarFindings
    • Susan Dumais and Hao Chen. 2000. Hierarchical classification of Web content. In SIGIR. ACM, 256–263.
      Google ScholarLocate open access versionFindings
    • Eva Gibaja and Sebastián Ventura. 2015. A tutorial on multilabel learning. ACM Computing Surveys (CSUR) 47, 3 (2015), 52.
      Google ScholarLocate open access versionFindings
    • Siddharth Gopal and Yiming Yang. 20Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD. 257–265.
      Google ScholarFindings
    • Siddharth Gopal and Yiming Yang. 2015. Hierarchical Bayesian inference and recursive regularization for large-scale classification. ACM Transactions on Knowledge Discovery from Data (TKDD) 9, 3 (2015), 18.
      Google ScholarLocate open access versionFindings
    • Siddharth Gopal, Yiming Yang, Bing Bai, and Alexandru Niculescu-Mizil. 2012. Bayesian models for large-scale hierarchical classification. In NIPS. 2411–2419.
      Google ScholarFindings
    • Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on Graph-Structured Data. CoRR abs/1506.05163 (2015). http://arxiv.org/abs/1506.05163
      Findings
    • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780.
      Google ScholarLocate open access versionFindings
    • Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In ECML. 137–142.
      Google ScholarLocate open access versionFindings
    • Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional Neural Network for Modelling Sentences. In ACL. 655–665.
      Google ScholarLocate open access versionFindings
    • Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP. 1746–1751.
      Google ScholarFindings
    • Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
      Google ScholarFindings
    • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
      Google ScholarFindings
    • Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification. In AAAI. 2267–2273.
      Google ScholarFindings
    • Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature 521 (2015), 436–444.
      Google ScholarLocate open access versionFindings
    • Yann Lecun, LÃľon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. In Proceedings of the IEEE. 2278– 2324.
      Google ScholarLocate open access versionFindings
    • Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. 2177–2185.
      Google ScholarFindings
    • David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, Apr (2004), 361–397.
      Google ScholarLocate open access versionFindings
    • Xin Li and Dan Roth. 2002. Learning question classifiers. In ACL. 1–7.
      Google ScholarLocate open access versionFindings
    • Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep
      Google ScholarFindings
    • Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explorations Newsletter 7, 1 (2005), 36–43.
      Google ScholarLocate open access versionFindings
    • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Computer Science (2013).
      Google ScholarLocate open access versionFindings
    • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111–3119.
      Google ScholarFindings
    • Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning Convolutional Neural Networks for Graphs. In ICML. 2014–2023.
      Google ScholarFindings
    • François Rousseau, Emmanouil Kiagias, and Michalis Vazirgiannis. 2015. Text categorization as a graph classification problem. In ACL, Vol. 15. 107.
      Google ScholarLocate open access versionFindings
    • Evan Sandhaus. 2008. The New York Times Annotated Corpus LDC2008T19. In Linguistic Data Consortium.
      Google ScholarFindings
    • Sam Scott and Stan Matwin. 1999. Feature Engineering for Text Classification. In ICML. 379–388.
      Google ScholarFindings
    • Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
      Google ScholarFindings
    • Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS. 801–809.
      Google ScholarLocate open access versionFindings
    • Aixin Sun and Ee-Peng Lim. 2001. Hierarchical Text Classification and Evaluation. In ICDM. 521–528.
      Google ScholarLocate open access versionFindings
    • Duyu Tang, Bing Qin, and Ting Liu. 2015. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification. In EMNLP. 1422–1432.
      Google ScholarFindings
    • Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, Sep (2005), 1453–1484.
      Google ScholarLocate open access versionFindings
    • Wei Wang, Diep Bich Do, and Xuemin Lin. 2005. Term graph model for text classification. In International Conference on Advanced Data Mining and Applications. Springer, 19–30.
      Google ScholarFindings
    • Lin Xiao, Dengyong Zhou, and Mingrui Wu. 2011. Hierarchical classification via orthogonal transfer. In ICML. 801–808.
      Google ScholarLocate open access versionFindings
    • Gui-Rong Xue, Dikan Xing, Qiang Yang, and Yong Yu. 2008. Deep classification in large-scale text hierarchies. In SIGIR. 619–626.
      Google ScholarLocate open access versionFindings
    • Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In NAACL–HLT. 1480–1489.
      Google ScholarFindings
    • Min-Ling Zhang and Zhi-Hua Zhou. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 1819–1837.
      Google ScholarLocate open access versionFindings
    • Wenjie Zhang, Liwei Wang, Junchi Yan, Xiangfeng Wang, and Hongyuan Zha. 2017. Deep Extreme Multi-label Learning. CoRR abs/1704.03718 (2017).
      Findings
    • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In NIPS. 649–657.
      Google ScholarFindings
    Your rating :
    0

     

    Tags
    Comments