Adversarial Mutual Information for Text Generation

ICML 2020, 2020.

Cited by: 0|Bibtex|Views85|Links
Keywords:
sequence transduction modelgenerative adversarial networktext generationmaximum likelihood estimationmaximum mutual informationMore(10+)
Weibo:
We introduced Adversarial Mutual Information, a novel text generation framework that addresses a minimax game to iteratively learn and optimize the mutual information between the source and target text

Abstract:

Recent advances in maximizing mutual information (MI) between the source and target have demonstrated its effectiveness in text generation. However, previous works paid little attention to modeling the backward network of MI (i.e., dependency from the target to the source), which is crucial to the tightness of the variational informatio...More

Code:

Data:

0
Introduction
  • Generating diverse and meaningful text is one of the coveted goals in machine learning research.
  • Most sequence transductive models for text generation can be efficiently trained by maximum likelihood estimation (MLE) and have demonstrated dominant performance in various tasks, such as dialog generation (Serban et al, 2016; Park et al, 2018), machine translation (Bahdanau et al, 2014; Vaswani et al, 2017) and document summarization (See et al, 2017).
Highlights
  • Generating diverse and meaningful text is one of the coveted goals in machine learning research
  • Compared to mutual information, our method significantly exceeds on both the relevance and diversity metrics, which indicates that our adversarial training scheme helps improve the quality of the backward model, which subsequently provides more credible rewards to the forward network
  • We conduct ablation experiments that when we use beam search for sampling instead of our proposed latent sampling noise, we find that the results drop on both metrics, which demonstrates that this sampling strategy can enlarge the searching space from the high-level structure while maximizing the mutual information between the output and the source text
  • We introduced Adversarial Mutual Information (AMI), a novel text generation framework that addresses a minimax game to iteratively learn and optimize the mutual information between the source and target text
  • The forward network in the framework is trained by playing against the backward network that aims to reconstruct the source text only if its input is in the real target distribution
  • The experimental results on two popular text generative tasks demonstrated the effectiveness of our framework, and we show our method has the potential to lead a tighter lower bound of the mutual information problem
Methods
  • To show the effectiveness of the approach, the authors use the bidirectional LSTM and Transformer (Vaswani et al, 2017) as the base architectures.
  • The authors apply our AMI framework to the dialog generation (§4.1) and neural machine translation (§4.2), which are representatives of two popular text generation tasks, and conduct comprehensive analyses on them.
  • The authors first verify the effectiveness of the method on the dialog generation task, which requires to generate a coherent and meaningful response given a conversation history.
  • The dataset consists of conversations between crowdworkers who were randomly paired and asked to act the part of a given persona and chat naturally.
  • There are around 160,000 utterances in around 11,000 dialogues, with 2000 dialogues for validation and test, which use non-overlapping personas
Results
  • Compared to the MMI with fixed φ, the performance of the regular MMI backward model drops and its output text is more relevant to ground truth when fed with the synthetic data rather than the real data
  • This means such a training framework misleads the backward model to give lower rewards to the generated targets that look more like the real ones, and results in negative effect to the optimization of the forward model.
  • Human: An additional radar sensor checks whether the green phase for the pedestrian can be ended
Conclusion
  • The authors introduced Adversarial Mutual Information (AMI), a novel text generation framework that addresses a minimax game to iteratively learn and optimize the mutual information between the source and target text.
  • The forward network in the framework is trained by playing against the backward network that aims to reconstruct the source text only if its input is in the real target distribution.
  • The experimental results on two popular text generative tasks demonstrated the effectiveness of the framework, and the authors show the method has the potential to lead a tighter lower bound of the MMI problem.
  • The authors will attempt to explore a lower variance and more unbiased gradient estimator for the text generator in this framework and apply the AMI in multi-modality situations
Summary
  • Introduction:

    Generating diverse and meaningful text is one of the coveted goals in machine learning research.
  • Most sequence transductive models for text generation can be efficiently trained by maximum likelihood estimation (MLE) and have demonstrated dominant performance in various tasks, such as dialog generation (Serban et al, 2016; Park et al, 2018), machine translation (Bahdanau et al, 2014; Vaswani et al, 2017) and document summarization (See et al, 2017).
  • Methods:

    To show the effectiveness of the approach, the authors use the bidirectional LSTM and Transformer (Vaswani et al, 2017) as the base architectures.
  • The authors apply our AMI framework to the dialog generation (§4.1) and neural machine translation (§4.2), which are representatives of two popular text generation tasks, and conduct comprehensive analyses on them.
  • The authors first verify the effectiveness of the method on the dialog generation task, which requires to generate a coherent and meaningful response given a conversation history.
  • The dataset consists of conversations between crowdworkers who were randomly paired and asked to act the part of a given persona and chat naturally.
  • There are around 160,000 utterances in around 11,000 dialogues, with 2000 dialogues for validation and test, which use non-overlapping personas
  • Results:

    Compared to the MMI with fixed φ, the performance of the regular MMI backward model drops and its output text is more relevant to ground truth when fed with the synthetic data rather than the real data
  • This means such a training framework misleads the backward model to give lower rewards to the generated targets that look more like the real ones, and results in negative effect to the optimization of the forward model.
  • Human: An additional radar sensor checks whether the green phase for the pedestrian can be ended
  • Conclusion:

    The authors introduced Adversarial Mutual Information (AMI), a novel text generation framework that addresses a minimax game to iteratively learn and optimize the mutual information between the source and target text.
  • The forward network in the framework is trained by playing against the backward network that aims to reconstruct the source text only if its input is in the real target distribution.
  • The experimental results on two popular text generative tasks demonstrated the effectiveness of the framework, and the authors show the method has the potential to lead a tighter lower bound of the MMI problem.
  • The authors will attempt to explore a lower variance and more unbiased gradient estimator for the text generator in this framework and apply the AMI in multi-modality situations
Tables
  • Table1: Quantitative evaluation for dialog generation on the PersonaChat dataset. “LNS” denotes the latent noise sampling. The top part presents the baselines, the medium and bottom parts show the performance and ablation results of our AMI framework based on the LSTM and Transformer. Our AMI significantly improves both LSTM and Transformer and achieves the state-of-the-art results
  • Table2: Embedding average for the output of the backward network, where the input is the utterance (1) generated by the forward network or (2) from the dataset created by human
  • Table3: Human evaluation results for the dialog generation. “TF” means the Transformer, “Human” means the original humancreated dialog utterances in the dataset. Both “Wins” and “Losses” point to the left models
  • Table4: Examples of dialog generation. “Human” denotes the original response in the dataset, and all the generation models represent the role [A]
  • Table5: Machine translation results of the BLEU scores on WMT English-German for newstest2014
  • Table6: An example of German-to-English machine translation. “Human” denotes the original sentence in the dataset
  • Table7: Two Examples of dialog generation in PersonaChat dataset. “Human” denotes the original response in the dataset, and all the generation models represent the role [A]
Download tables as Excel
Related work
  • Estimating mutual information (Bahl et al, 1986; Brown, 1987) has been comprehensively studied for many tasks such as Bayesian optimal experimental design (Ryan et al, 2016; Foster et al, 2018), image caption retrieval (Mao et al, 2015), neural networks explanation (Tishby et al, 2000; Tishby & Zaslavsky, 2015; Gabrieet al., 2018), etc. However, adapting MMI to sequence modeling such as text generation is empirically nontrivial as we typically have access to discrete samples but not the underlying distributions (Poole et al, 2019). Li et al (2016a) proposed to use MMI as the objective function to address the issue of output diversity in the neural generation framework. However, they use the MI-prompting objective only for testing, while the training procedure remains the same as the standard MLE. Li et al (2016b) addressed this problem by using deep reinforcement learning to set the mutual information as the future reward. Zhang et al (2018b) learned a dual objective to simultaneously learn two mutual information between the forward and backward models. Ye et al (2019) proposed the dual information maximization to jointly model the dual information of two tasks. However, they optimize the backward model in the same direction with the forward model, which would limit its approach to the true posterior distribution, thus result in an unreliable reward for the forward model.
Funding
  • This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 61936006), in part by the Alibaba-Zhejiang University Joint Institute of Frontier Technologies, in part by the China Scholarship Council, and in part by U.S Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC5207NA27344
Reference
  • Akoury, N., Krishna, K., and Iyyer, M. Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1269–1281, 2019.
    Google ScholarLocate open access versionFindings
  • Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223, 2017.
    Google ScholarLocate open access versionFindings
  • Artetxe, M., Labaka, G., Agirre, E., and Cho, K. Unsupervised neural machine translation. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Bahl, L., Brown, P., De Souza, P., and Mercer, R. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In ICASSP’8IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 11, pp. 49–52. IEEE, 1986.
    Google ScholarLocate open access versionFindings
  • Barber, D. and Agakov, F. V. The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, pp. None, 2003.
    Google ScholarLocate open access versionFindings
  • Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL
    Google ScholarLocate open access versionFindings
  • Conference on Computational Natural Language Learning, pp. 10–21, 2016.
    Google ScholarLocate open access versionFindings
  • Brown, P. F. The acoustic-modeling problem in automatic speech recognition. Technical report, CARNEGIEMELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, 1987.
    Google ScholarLocate open access versionFindings
  • Che, T., Li, Y., Zhang, R., Hjelm, R. D., Li, W., Song, Y., and Bengio, Y. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.
    Findings
  • Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D., Zhang, Y., Wang, G., Zhang, R., and Carin, L. Adversarial text generation via feature-mover’s distance. In Advances in Neural Information Processing Systems, pp. 4666–4677, 2018.
    Google ScholarLocate open access versionFindings
  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
    Google ScholarLocate open access versionFindings
  • Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734, 2014.
    Google ScholarLocate open access versionFindings
  • Foster, A., Jankowiak, M., Bingham, E., Teh, Y. W., Rainforth, T., and Goodman, N. Variational optimal experiment design: Efficient automation of adaptive experiments. NeurIPS Bayesian Deep Learning Workshop, 2018.
    Google ScholarFindings
  • Gabrie, M., Manoel, A., Luneau, C., Macris, N., Krzakala, F., Zdeborova, L., et al. Entropy and mutual information in models of deep neural networks. In Advances in Neural Information Processing Systems, pp. 1821–1831, 2018.
    Google ScholarLocate open access versionFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J. Long text generation via adversarial training with leaked information. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T.-Y., and Ma, W.-Y. Dual learning for machine translation. In Advances in neural information processing systems, pp. 820–828, 2016.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ICLR, 2014.
    Google ScholarLocate open access versionFindings
  • Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810, 2017.
    Findings
  • Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119, 2016a.
    Google ScholarLocate open access versionFindings
  • Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202, 2016b.
    Google ScholarLocate open access versionFindings
  • Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2157–2169, 2017.
    Google ScholarLocate open access versionFindings
  • Lin, K., Li, D., He, X., Zhang, Z., and Sun, M.-T. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165, 2017.
    Google ScholarLocate open access versionFindings
  • Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
    Google ScholarFindings
  • Pan, B., Yang, Y., Li, H., Zhao, Z., Zhuang, Y., Cai, D., and He, X. Macnet: Transferring knowledge from machine comprehension to sequence-to-sequence models. In Advances in Neural Information Processing Systems, pp. 6092–6102, 2018.
    Google ScholarLocate open access versionFindings
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Park, Y., Cho, J., and Kim, G. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1792–1801, 2018.
    Google ScholarLocate open access versionFindings
  • Paulus, R., Xiong, C., and Socher, R. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180, 2019.
    Google ScholarLocate open access versionFindings
  • Rus, V. and Lintean, M. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157–162, 2012.
    Google ScholarLocate open access versionFindings
  • Ryan, E. G., Drovandi, C. C., McGree, J. M., and Pettitt, A. N. A review of modern computational algorithms for bayesian optimal design. International Statistical Review, 84(1):128–154, 2016.
    Google ScholarLocate open access versionFindings
  • See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1073–1083, 2017.
    Google ScholarLocate open access versionFindings
  • Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • Serban, I. V., Sordoni, A., Bengio, Y., Courville, A., and Pineau, J. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
    Google ScholarFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
    Google ScholarFindings
  • Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • Wu, C.-S., Socher, R., and Xiong, C. Global-to-local memory pointer networks for task-oriented dialogue. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Ye, H., Li, W., and Wang, L. Jointly learning semantic parser and natural language generator via dual information maximization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2090–2101, 2019.
    Google ScholarLocate open access versionFindings
  • Yi, Z., Zhang, H., Tan, P., and Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857, 2017.
    Google ScholarLocate open access versionFindings
  • Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Zeng, M., Wang, Y., and Luo, Y. Dirichlet latent variable hierarchical recurrent encoder-decoder in dialogue generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 1267–1272, 2019.
    Google ScholarLocate open access versionFindings
  • Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and Weston, J. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2204–2213, 2018a.
    Google ScholarLocate open access versionFindings
  • Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and Carin, L. Adversarial feature matching for text generation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4006– 4015, 2017.
    Google ScholarLocate open access versionFindings
  • Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., and Dolan, B. Generating informative and diverse conversational responses via adversarial information maximization. In Advances in Neural Information Processing Systems, pp. 1815–1825, 2018b.
    Google ScholarLocate open access versionFindings
  • Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
    Google ScholarLocate open access versionFindings
  • where (Pr, Pθ) is the set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pθ. Intuitively, γ(x, y) indicates how much mass must be transported from x to y in order to transform the distributions Pr into the distribution Pθ. The Wasserstein distance then is the “cost” of the optimal transport plan, which is also called the EarthMover distance. The Wasserstein distance is proven to be much weaker than many other common distances (e.g. JS distance) so simple sequences of probability distributions are more likely to converge under this distance (Arjovsky et al., 2017). In this paper, we prove that our proposed objective function is equivalent to minimizing the Wasserstein distance between the synthetic data distribution and the real data distribution.
    Google ScholarLocate open access versionFindings
  • We implement our models based on the OpenNMT framework3 (Klein et al., 2017).
    Google ScholarFindings
  • When training our NMT systems, we split the data into subword units using BPE (Sennrich et al., 2016). We train 4-layer LSTMs of 1024 units with bidirectional encoder, the embedding dimension is 1024. The model is trained with
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments