DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

ACL, pp. 270-278, 2020.

Cited by: 36|Bibtex|Views175|Links
EI
Keywords:
mutual informationreal worldpre trainingconversational systemtraining pipelineMore(14+)
Weibo:
Neural response generation is a subcategory of text-generation that shares the objective of generating natural-looking text that is relevant to the prompt

Abstract:

We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance...More

Code:

Data:

0
Introduction
  • The authors introduce DIALOGPT, a tunable gigawordscale neural network model for generation of conversational reponses, trained on Reddit data.

    Recent advances in large-scale pre-training using transformer-based architectures (Radford et al, 2018; Devlin et al, 2019; Raffel et al, 2019) have achieved great empirical success.
  • OpenAI’s GPT-2 (Radford et al, 2018), for example, has demonstrated that transformer models trained on very large datasets can capture long-term dependencies in textual data and generate text that is fluent, lexically diverse, and rich in content.
  • Human conversations are generally more informal, noisy, and, when in the form of textual chat, often contain informal abbreviations or syntactic/lexical errors
Highlights
  • We introduce DIALOGPT, a tunable gigawordscale neural network model for generation of conversational reponses, trained on Reddit data
  • OpenAI’s GPT-2 (Radford et al, 2018), for example, has demonstrated that transformer models trained on very large datasets can capture long-term dependencies in textual data and generate text that is fluent, lexically diverse, and rich in content. Such models have the capacity to capture textual data with fine granularity and produce output with a high-resolution that closely emulates real-world text written by humans
  • Neural response generation is a subcategory of text-generation that shares the objective of generating natural-looking text that is relevant to the prompt
  • The package consists of a distributed training pipeline and several pre-trained models that can be fine-tuned to obtain a conversation model on a moderately-sized customized dataset in few hours
  • We will investigate leveraging reinforcement learning to further improve the relevance of the generated responses and prevent the model from generating egregious responses
Methods
  • 3.1 Model Architecture

    The authors trained the DIALOGPT model on the basis of the GPT-2 (Radford et al, 2018) architecture.The GPT-2 transformer model adopts the generic transformer language model (Vaswani et al, 2017) and leverages a stack of masked multi-head selfattention layers to train on massive web-text data.
  • The authors follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language modeling.
  • DIALOGPT (345M) DIALOGPT (345M,Beam) Human NIST N-2 N-4 BLEU B-2 B-4 METEOR Entropy E-4 Dist D-1 D-2.
  • Table 3: 6K Reddit multi-reference evaluation.
  • The authors' observations suggest that the system is able to deal with multi-turn generation better than an RNN counterpart and tends to be more consistent with respect to context (Table 5).7 source who is the first president of the United States?
  • The authors' observations suggest that the system is able to deal with multi-turn generation better than an RNN counterpart and tends to be more consistent with respect to context (Table 5).7 source who is the first president of the United States? what is the boiling point of water? which one is bigger, sun or moon? which animal has black and white stripes?
Conclusion
  • The authors have released an open-domain pre-trained model, DIALOGPT, trained on massive real-world Reddit dataset.
  • DIALOGPT is fully opensourced and easy to deploy, allowing users to extend the pre-trained conversational system to bootstrap training using various datasets.
  • Detection and control of toxic output will be a major focus of future investigation.
  • The authors will investigate leveraging reinforcement learning to further improve the relevance of the generated responses and prevent the model from generating egregious responses
Summary
  • Introduction:

    The authors introduce DIALOGPT, a tunable gigawordscale neural network model for generation of conversational reponses, trained on Reddit data.

    Recent advances in large-scale pre-training using transformer-based architectures (Radford et al, 2018; Devlin et al, 2019; Raffel et al, 2019) have achieved great empirical success.
  • OpenAI’s GPT-2 (Radford et al, 2018), for example, has demonstrated that transformer models trained on very large datasets can capture long-term dependencies in textual data and generate text that is fluent, lexically diverse, and rich in content.
  • Human conversations are generally more informal, noisy, and, when in the form of textual chat, often contain informal abbreviations or syntactic/lexical errors
  • Methods:

    3.1 Model Architecture

    The authors trained the DIALOGPT model on the basis of the GPT-2 (Radford et al, 2018) architecture.The GPT-2 transformer model adopts the generic transformer language model (Vaswani et al, 2017) and leverages a stack of masked multi-head selfattention layers to train on massive web-text data.
  • The authors follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and frame the generation task as language modeling.
  • DIALOGPT (345M) DIALOGPT (345M,Beam) Human NIST N-2 N-4 BLEU B-2 B-4 METEOR Entropy E-4 Dist D-1 D-2.
  • Table 3: 6K Reddit multi-reference evaluation.
  • The authors' observations suggest that the system is able to deal with multi-turn generation better than an RNN counterpart and tends to be more consistent with respect to context (Table 5).7 source who is the first president of the United States?
  • The authors' observations suggest that the system is able to deal with multi-turn generation better than an RNN counterpart and tends to be more consistent with respect to context (Table 5).7 source who is the first president of the United States? what is the boiling point of water? which one is bigger, sun or moon? which animal has black and white stripes?
  • Conclusion:

    The authors have released an open-domain pre-trained model, DIALOGPT, trained on massive real-world Reddit dataset.
  • DIALOGPT is fully opensourced and easy to deploy, allowing users to extend the pre-trained conversational system to bootstrap training using various datasets.
  • Detection and control of toxic output will be a major focus of future investigation.
  • The authors will investigate leveraging reinforcement learning to further improve the relevance of the generated responses and prevent the model from generating egregious responses
Tables
  • Table1: Model configurations. “B” denotes batch size per GPU
  • Table2: DSTC evaluation. “Team B” is the winner system of the DSTC-7 challenge. “Beam” denotes beam search. “Human” represents the held-out ground truth reference
  • Table3: Table 3
  • Table4: Addressing commonsense questions
  • Table5: An interactive example of multi-turn dialogue
  • Table6: An example of multi-turn self-playing dialogue with user prompt for relevance, informativeness and how humanlike the generating is using a 3-point Likert-like scale. Judges were required to pass a qualification test, and a regime of spam detection was imposed.8 Overall judge preferences for relevance, informativeness and human-likeness, presented as raw numbers and a percentage of the total, are shown in Table 7. A strong preference can be observed for DialoGPT over PersonalityChat
  • Table7: Results of Human Evaluation for relevance, informativeness and human-response possibility, showing preferences (%) for our model (DialoGPT) vis-a-vis its variants and real human responses. Distributions skew towards DialoGPT with MMI, even when compared with human outputs. Numbers in bold indicate the preferred systems. Statistically significant results are indicated: * p ≤ 0.01, ** p ≤ 0.001, *** p ≤ 0.0001, **** p ≤ 0.00001
  • Table8: Human evaluation significance test. Bold results represent differences that are NOT statistically significant. Notation: 1 - Human response; 2 - DIALOGPT 345M; 3 - PersonalityChat; 4 - DIALOGPT 345M w/ MMI; 5 - DIALOGPT 345M Beam search; 6 - DIALOGPT 762M
Download tables as Excel
Related work
  • There are several open-sourced toolkits for largescale pre-trained transformer models. Huggingface Conv-AI transfer learning repository (Wolf et al, 2019) contains the code for training conversational AI systems with transfer learning based on the GPT-2 transformer language model, which achieves the state-of-the-art performance on ConvAI-2 dialogue competition. DLGnet (Olabiyi and Mueller, 2019) is a large transformer model trained on dialogue dataset and achieves good performance in multi-turn dialogue generation. AllenNLP (Gardner et al, 2018) is developed as a toolkit for many natural language processing tasks, including the large-scale pre-trained bi-LSTM sentence representation learning framework ELMo (Peters et al, 2018). Texar (Hu et al, 2018) focuses on text generation including style transferring and controllable generation. It includes reinforcement learning capabilities along with its sequence modelling tools. DeepPavlov (Burtsev et al, 2018) is a popular framework focusing on task-oriented dialogue. This public repository contains several demos and pre-trained models for question answering and sentiment classification. Icecaps (Shiv et al, 2019) is a response generation toolkit with techniques such as grounding on personalities or external knowledge and multi-task training. The ConvAI2 challenge (Dinan et al, 2019) has a focus on personalized conversations. ParlAI (Miller et al, 2017) is another library for developing task-oriented dialogue systems. It contains pre-trained models for knowledge-grounded chatbot trained with crowdsourced data. The Text-to-Text Transformer (Raffel et al, 2019) unifies multiple text modeling tasks, and achieves the state-of-the-art results in various natural language generation and understanding benchmarks.
Reference
  • M. Burtsev, A. Seliverstov, R. Airapetyan, M. Arkhipov, D. Baymurzina, N. Bushkov, O. Gureenkova, T. Khakhulin, Y. Kuratov, D. Kuznetsov, A. Litinsky, V. Logacheva, A. Lymar, V. Malykh, M. Petrov, V. Polulyakh, L. Pugachev, A. Sorokin, M. Vikhreva, and M. Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019.
    Google ScholarLocate open access versionFindings
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, S. Prabhumoye, A. W. Black, A. Rudnicky, J. Williams, J. Pineau, M. Burtsev, and J. Weston. 2019. The second conversational intelligence challenge (ConvAI2).
    Google ScholarFindings
  • George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the second international conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc.
    Google ScholarLocate open access versionFindings
  • Michel Galley, Chris Brockett, Xiang Gao, Jianfeng Gao, and Bill Dolan. 2019. Grounded response generation task at DSTC7. In AAAI Dialog System Technology Challenges Workshop.
    Google ScholarLocate open access versionFindings
  • J. Gao, M. Galley, and L. Li. 2019a. Neural approaches to conversational AI. Foundations and Trends in Information Retrieval.
    Google ScholarLocate open access versionFindings
  • Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2019b. Jointly optimizing diversity and relevance in neural response generation. NAACL-HLT 2019.
    Google ScholarLocate open access versionFindings
  • Xiang Gao, Yizhe Zhang, Sungjin Lee, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2019c. Structuring latent spaces for stylized response generation. EMNLP-IJCNLP.
    Google ScholarLocate open access versionFindings
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software.
    Google ScholarLocate open access versionFindings
  • Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Eskenazi, and Jeffrey P Bigham. 2019. Investigating evaluation of open-domain dialogue systems with human generated multiple references. arXiv preprint arXiv:1907.10568.
    Findings
  • Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, L. Qin, D. Wang, et al. 2018. Texar: A modularized, versatile, and extensible toolkit for text generation. ACL.
    Google ScholarLocate open access versionFindings
  • HuggingFace. 2019. PyTorch transformer repository. https://github.com/huggingface/pytorch-transformers.
    Findings
  • Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A diversity-promoting objective function for neural conversation models. NAACL.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016b. A persona-based neural conversation model. ACL.
    Google ScholarLocate open access versionFindings
  • A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of the 2017 EMNLP System Demonstration.
    Google ScholarLocate open access versionFindings
  • Oluwatobi Olabiyi and Erik T Mueller. 2019.
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. ACL.
    Google ScholarLocate open access versionFindings
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. NAACL.
    Google ScholarLocate open access versionFindings
  • Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019. Conversing by reading: Contentful neural conversation with on-demand machine reading. ACL.
    Google ScholarLocate open access versionFindings
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 2018. Language models are unsupervised multitask learners. Technical report, OpenAI.
    Google ScholarFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint:1910.10683.
    Findings
  • R. Sennrich, B. Haddow, and A. Birch. 2016. Neural machine translation of rare words with subword units. ACL.
    Google ScholarLocate open access versionFindings
  • Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. AAAI.
    Google ScholarLocate open access versionFindings
  • Vighnesh Leonardo Shiv, Chris Quirk, Anshuman Suri, Xiang Gao, Khuram Shahid, Nithya Govindarajan, Yizhe Zhang, Jianfeng Gao, Michel Galley, Chris Brockett, et al. 2019. Microsoft icecaps: An opensource toolkit for conversation modeling. ACL.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS.
    Google ScholarFindings
  • Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning.
    Google ScholarFindings
  • Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. TransferTransfo: A transfer learning approach for neural network based conversational agents. CoRR, abs/1901.08149.
    Findings
  • Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. NeurIPS.
    Google ScholarLocate open access versionFindings
  • Yizhe Zhang, Xiang Gao, Sungjin Lee, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2019. Consistent dialogue generation with self-supervised feature learning. arXiv preprint:1903.05759.
    Findings
Your rating :
0

 

Tags
Comments