Maximum Reconstruction Estimation for Generative Latent-Variable Models

AAAI, pp. 3173-3179, 2017.

Cited by: 1|Bibtex|Views38|Links
EI
Keywords:
latent-variable probabilistic context-free grammarscommon correlationhidden markov modelvariation of informationmaximum likelihood estimationMore(11+)
Weibo:
We develop tractable algorithms to directly learn hidden Markov models and IBM translation models using the MRE criterion, without the need to introduce a separate reconstruction model to facilitate efficient inference

Abstract:

Generative latent-variable models are important for natural language processing due to their capability of providing compact representations of data. As conventional maximum likelihood estimation (MLE) is prone to focus on explaining irrelevant but common correlations in data, we apply maximum reconstruction estimation (MRE) to learning g...More

Code:

Data:

0
Introduction
  • The need to learn latent structures from unlabeled data arises in many different problems in natural language processing (NLP), including part-of-speech (POS) induction (Merialdo 1994; Johnson 2007), word alignment (Brown et al 1993; Vogel, Ney, and Tillmann 1996), syntactic parsing (Klein and Manning 2004; Smith and Eisner 2005), and semantic parsing (Poon and Domingos 2009)
  • Generative latentvariable models such as hidden Markov models (HMMs) and latent-variable probabilistic context-free grammars (LPCFGs) have been widely used for unsupervised structured prediction due to their capability of providing compact representations of data.
  • While previous work has to maintain separate sets of model parameters for encoding and reconstruction to facilitate efficient inference (Ammar, Dyer, and Smith 2014), the approach is capable of directly learning intended model parameters
Highlights
  • The need to learn latent structures from unlabeled data arises in many different problems in natural language processing (NLP), including part-of-speech (POS) induction (Merialdo 1994; Johnson 2007), word alignment (Brown et al 1993; Vogel, Ney, and Tillmann 1996), syntactic parsing (Klein and Manning 2004; Smith and Eisner 2005), and semantic parsing (Poon and Domingos 2009)
  • The Expectation Maximization algorithm for maximum likelihood estimation runs for 100 iterations and the exponentiated gradient algorithm with adaptive learning rate runs for 50 iterations with initialization of a basic hidden Markov models (Ammar, Dyer, and Smith 2014)
  • Comparison with maximum likelihood estimation Table 1 shows the comparison of maximum likelihood estimation and maximum reconstruction estimation
  • We find that maximum reconstruction estimation outperforms maximum likelihood estimation for 50-state hidden Markov models in terms of both many-to-one accuracy and variation of information, suggesting that our approach is capable of guiding the hidden Markov models to use latent structures to find intended correlations in the data
  • We have presented maximum reconstruction estimation for training generative latent-variable models such as hidden Markov models and IBM translation models
Methods
  • The authors evaluated the approach on two unsupervised NLP tasks: part-of-speech induction and word alignment.
Results
  • Evaluation on

    Part-of-Speech Induction

    Setting The authors split the English Penn Treebank into two parts: 46K sentences for training and test and 1K sentences for optimizing hyper-parameters of the exponentiated gradient (EG) algorithm with adaptive learning rate.
  • The reconstruction probability of training examples using model parameters learned by MLE (i.e., P (x|x; θMLE)) is e−105.
  • The evaluation metric is alignment error rate (AER) (Och and Ney 2003).
  • Both MLE and MRE use the following training scheme: 5 iterations for IBM Model 1 and 5 iterations for IBM Model 2.
  • The authors distinguish between two translation directions: Chineseto-English (C → E) and English-to-Chinese (E → C)
Conclusion
  • The authors have presented maximum reconstruction estimation for training generative latent-variable models such as hidden Markov models and IBM translation models.
  • The authors plan to apply the approach to more generative latentvariable models such as probabilistic context-free grammars and explore the possibility of developing new training algorithms that minimize reconstruction errors.
  • Calculating Expectations for MRE Training of Hidden Markov Models.
  • The backward probability βn(z) as βn(z) if n = N z p(z|z )p2βn+1(z ) otherwise (33).
Summary
  • Introduction:

    The need to learn latent structures from unlabeled data arises in many different problems in natural language processing (NLP), including part-of-speech (POS) induction (Merialdo 1994; Johnson 2007), word alignment (Brown et al 1993; Vogel, Ney, and Tillmann 1996), syntactic parsing (Klein and Manning 2004; Smith and Eisner 2005), and semantic parsing (Poon and Domingos 2009)
  • Generative latentvariable models such as hidden Markov models (HMMs) and latent-variable probabilistic context-free grammars (LPCFGs) have been widely used for unsupervised structured prediction due to their capability of providing compact representations of data.
  • While previous work has to maintain separate sets of model parameters for encoding and reconstruction to facilitate efficient inference (Ammar, Dyer, and Smith 2014), the approach is capable of directly learning intended model parameters
  • Methods:

    The authors evaluated the approach on two unsupervised NLP tasks: part-of-speech induction and word alignment.
  • Results:

    Evaluation on

    Part-of-Speech Induction

    Setting The authors split the English Penn Treebank into two parts: 46K sentences for training and test and 1K sentences for optimizing hyper-parameters of the exponentiated gradient (EG) algorithm with adaptive learning rate.
  • The reconstruction probability of training examples using model parameters learned by MLE (i.e., P (x|x; θMLE)) is e−105.
  • The evaluation metric is alignment error rate (AER) (Och and Ney 2003).
  • Both MLE and MRE use the following training scheme: 5 iterations for IBM Model 1 and 5 iterations for IBM Model 2.
  • The authors distinguish between two translation directions: Chineseto-English (C → E) and English-to-Chinese (E → C)
  • Conclusion:

    The authors have presented maximum reconstruction estimation for training generative latent-variable models such as hidden Markov models and IBM translation models.
  • The authors plan to apply the approach to more generative latentvariable models such as probabilistic context-free grammars and explore the possibility of developing new training algorithms that minimize reconstruction errors.
  • Calculating Expectations for MRE Training of Hidden Markov Models.
  • The backward probability βn(z) as βn(z) if n = N z p(z|z )p2βn+1(z ) otherwise (33).
Tables
  • Table1: Comparison of MLE and MRE on HMMs for unsupervised part-of-speech induction. The evaluation metrics are many-to-one accuracy (accuracy) and variation of information (VI)
  • Table2: Effect of training corpus size
  • Table3: Example emission probabilities for the POS tag “VBD” (verb past tense)
  • Table4: Comparison between CRF Autoencoders and MRE on unsupervised part-of-speech induction
  • Table5: Comparison between MLE and MRE on IBM translation models for unsupervised word alignment. The evaluation metric is alignment error rate (AER)
  • Table6: Example translation probabilities of the Chinese word “wenzhang”
Download tables as Excel
Funding
  • This research is supported by the 863 Program (2015AA015407), the National Natural Science Foundation of China (No 61522204, 61361136003, 61532001), 1000 Talent Plan grant, Tsinghua Initiative Research Program grants 20151080475, a Google Faculty Research Award and Fund from Online Education Research Center, Ministry of Education (No 2016ZD102)
Reference
  • Alain, G.; Bengio, Y.; Yao, L.; Yosinski, J.; ThibodeauLaufer, E.; Zhang, S.; and Vincent, P. 2015. Gsns: Generative stochastic networks. arXiv: 1503.05571.
    Findings
  • Ammar, W.; Dyer, C.; and Smith, N. 2014. Conditional random field autoencoders for unsupervised structred prediction. In Proceedings of NIPS 2014.
    Google ScholarLocate open access versionFindings
  • Bagos, P.; Liakopoulos, T.; and Hamodrakas, S. 2004. Faster gradient descent training of hidden markov models, using individual learning rate adaptation. In Grammatical Inference: Algorithms and Applications. Springer.
    Google ScholarFindings
  • Beal, M. J. 2003. Variational algorithms for approximate Bayesian inference. University of London.
    Google ScholarFindings
  • Bengio, Y. 2009. Learning deep architectures for ai. Foundations and Trends in Machin Learning.
    Google ScholarFindings
  • Brown, P.; Della Pietra, S.; Della Pietra, V.; and Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics.
    Google ScholarFindings
  • Ganchev, K.; Graca, J. a.; Gillenwater, J.; and Taskar, B. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research.
    Google ScholarLocate open access versionFindings
  • Hinton, G.; Osindero, S.; and Teh, Y. 2006. Reducing the dimensionality of data with neural networks. Science.
    Google ScholarFindings
  • Johnson, M. 2007. Why doesn’t em find good hmm pos taggers. In Proceedings of EMNLP 2007.
    Google ScholarLocate open access versionFindings
  • Kivinen, J., and Warmuth, M. 1997. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation.
    Google ScholarLocate open access versionFindings
  • Klein, D., and Manning, C. 2004. Corpus-based induction of syntactic structure: Models of dependency and consistency. In Proceedings of ACL 2004.
    Google ScholarLocate open access versionFindings
  • Liu, Y., and Sun, M. 2015. Contrastive unsupervised word alignment with non-local features. In Proceedings of AAAI 2015.
    Google ScholarLocate open access versionFindings
  • Merialdo, B. 1994. Tagging english text with a probabilistic model. Computational Linguistics.
    Google ScholarFindings
  • Och, F., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics.
    Google ScholarFindings
  • Poon, H., and Domingos, P. 2009. Unsupervised semantic parsing. In Proceedings of EMNLP 2009.
    Google ScholarLocate open access versionFindings
  • Smith, N., and Eisner, J. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of ACL 2005.
    Google ScholarLocate open access versionFindings
  • Socher, R.; Huang, E.; Pennington, J.; Ng, A.; and Manning, C. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of NIPS 2011.
    Google ScholarLocate open access versionFindings
  • Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of ICML 2008.
    Google ScholarLocate open access versionFindings
  • Vincent, P.; H., L.; Lajoie, I.; Bengio, Y.; and Manzagol, P. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research.
    Google ScholarLocate open access versionFindings
  • Vogel, S.; Ney, H.; and Tillmann, C. 1996. Hmm-based word alignment in statistical translation. In Proceedings of COLING 1996.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments