Text-to-Text Pre-Training for Data-to-Text Tasks

Kale Mihir
Kale Mihir
Cited by: 0|Bibtex|Views88|Links
Keywords:
text taskfine tunelanguage modelart resultautomatic metricMore(5+)
Weibo:
In this study we evaluated pre-training in the form of T5 for the data-to-text task

Abstract:

We study the pre-train + fine-tune strategy for data-to-text tasks. Fine-tuning T5 achieves state-of-the-art results on the WebNLG, MultiWoz and ToTTo benchmarks. Moreover, the models are fully end-to-end and do not rely on any intermediate planning steps, delexicalization or copy mechanisms. T5 pre-training also enables stringer genera...More

Code:

Data:

0
Introduction
  • Natural language generation from structured data, or data-to-text (Kukich, 1983; McKeown, 1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc.
  • In this work the authors study the applicability of large scale transfer learning learning for this task.
  • The authors use the term ”pre-train + fine-tune” to refer to the paradigm of first pre-training a high capacity model on massive text corpora before fine-tuning it on a downstream task.
  • The authors' study shows that such form of transfer learning, which is ubiquitous in many areas of NLP (Devlin et al, 2018), works well for text generation from structured data as well.
  • The authors focus on pre-training in form of the “Text-to-Text Transfer Transformer” (T5) models released by Raffel et al (2019)
Highlights
  • Natural language generation from structured data, or data-to-text (Kukich, 1983; McKeown, 1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc
  • Our study shows that such form of transfer learning, which is ubiquitous in many areas of NLP (Devlin et al, 2018), works well for text generation from structured data as well
  • In this study we evaluated pre-training in the form of T5 for the data-to-text task
  • We found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs
  • Though we focused on automatic metrics, corroborating our findings via human evaluation is an important step
Methods
  • The T5 vocabulary consists of 32,000 sentencepieces. Following (Raffel et al, 2019), models are fine-tuned with a constant learning rate of 0.001.

    The best checkpoint is chosen based on the bleu score on the development set.
  • The T5 vocabulary consists of 32,000 sentencepieces.
  • Following (Raffel et al, 2019), models are fine-tuned with a constant learning rate of 0.001.
  • The best checkpoint is chosen based on the bleu score on the development set.
  • Decoding is done via greedy search.
  • The authors compute BLEU (Papineni et al, 2002) scores using sacrebleu (Post, 2018).
  • For each dataset the authors rely on metrics used by prior work
Results
  • Results and Discussion

    7.1 WebNLG

    The evaluation is done using BLEU and METEOR (Lavie and Agarwal, 2007), similar to (Ferreira et al, 2019).
  • T5-Large performs the best across BLEU as well as METEOR
  • It and improves over PlanEnc by 4.3 BLEU on the overall test set.
  • It displays excellent generalization to new domains and relations, with a 14 BLEU improvement on the unseen test set.
  • On the Unseen test set, T5-Small scores substantially lower, indicating that pre-training with large capacity models is required for out-of-domain generalization.
  • The model is more robust to out-of-domain tables, with larger improvements of 6.6 BLEU and 7.5 PARENT on the Non-Overlap test set.
  • While the SER scores are slightly worse, upon manual inspection the authors found that the difference can largely be attributed to false positives arising from a combination of annotation inconsistencies in the dataset coupled with the exact match constraint, which does not account for paraphrases
Conclusion
  • In this study the authors evaluated pre-training in the form of T5 for the data-to-text task.
  • The authors found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs.
  • Though the authors focused on automatic metrics, corroborating the findings via human evaluation is an important step.
  • The authors hope to design unsupervised pre-training objectives that are tailored for the data-to-text task
Summary
  • Introduction:

    Natural language generation from structured data, or data-to-text (Kukich, 1983; McKeown, 1985), is the task of generating a textual description conditioned on source content provided in the form of structured data such as a table, graph etc.
  • In this work the authors study the applicability of large scale transfer learning learning for this task.
  • The authors use the term ”pre-train + fine-tune” to refer to the paradigm of first pre-training a high capacity model on massive text corpora before fine-tuning it on a downstream task.
  • The authors' study shows that such form of transfer learning, which is ubiquitous in many areas of NLP (Devlin et al, 2018), works well for text generation from structured data as well.
  • The authors focus on pre-training in form of the “Text-to-Text Transfer Transformer” (T5) models released by Raffel et al (2019)
  • Methods:

    The T5 vocabulary consists of 32,000 sentencepieces. Following (Raffel et al, 2019), models are fine-tuned with a constant learning rate of 0.001.

    The best checkpoint is chosen based on the bleu score on the development set.
  • The T5 vocabulary consists of 32,000 sentencepieces.
  • Following (Raffel et al, 2019), models are fine-tuned with a constant learning rate of 0.001.
  • The best checkpoint is chosen based on the bleu score on the development set.
  • Decoding is done via greedy search.
  • The authors compute BLEU (Papineni et al, 2002) scores using sacrebleu (Post, 2018).
  • For each dataset the authors rely on metrics used by prior work
  • Results:

    Results and Discussion

    7.1 WebNLG

    The evaluation is done using BLEU and METEOR (Lavie and Agarwal, 2007), similar to (Ferreira et al, 2019).
  • T5-Large performs the best across BLEU as well as METEOR
  • It and improves over PlanEnc by 4.3 BLEU on the overall test set.
  • It displays excellent generalization to new domains and relations, with a 14 BLEU improvement on the unseen test set.
  • On the Unseen test set, T5-Small scores substantially lower, indicating that pre-training with large capacity models is required for out-of-domain generalization.
  • The model is more robust to out-of-domain tables, with larger improvements of 6.6 BLEU and 7.5 PARENT on the Non-Overlap test set.
  • While the SER scores are slightly worse, upon manual inspection the authors found that the difference can largely be attributed to false positives arising from a combination of annotation inconsistencies in the dataset coupled with the exact match constraint, which does not account for paraphrases
  • Conclusion:

    In this study the authors evaluated pre-training in the form of T5 for the data-to-text task.
  • The authors found that it leads to state-of-the-art results, while greatly improving robustness to out-of-domain inputs.
  • Though the authors focused on automatic metrics, corroborating the findings via human evaluation is an important step.
  • The authors hope to design unsupervised pre-training objectives that are tailored for the data-to-text task
Tables
  • Table1: Dataset sizes
  • Table2: Results on WebNLG. : Metrics as reported in Zhao et al (2020)
  • Table3: Results on the ToTTo test set
  • Table4: Results on the ToTTo development set for different variants of T5
  • Table5: Results on Multiwoz
Download tables as Excel
Related work
  • Transfer Learning Devlin et al (2018), Howard and Ruder (2018) showed that unsupervised pretraining can greatly benefit tasks like text classification, question answering, summarization etc. In particular, Raffel et al (2019) perform a large scale study of different training objectives, model capacity and size of data. Peng et al (2020) and Chen et al (2019b) show that pre-training in the form of GPT-2 can indeed improve performance on data-to-text task as well. Our experiments show that pre-training with T5, where both encoder and decoder are trained using a span masking objective, performs significantly better than encoder-only alternatives such as BERT and GPT-2. Some works have also studied pre-training via supervised objectives, such as machine translation Siddhant et al (2019); Kale and Roy (2020) and reading comprehension (Khashabi et al, 2020).

    Data-to-Text Early work on data-to-text focused on rule-based pipelined methods, while recent works have adopted neural approaches. Wen et al (2015) proposed the Semantically Controlled LSTM and were one of the first to show that neural networks can be successfully applied to this problem. Liu et al (2018) generate text by conditioning language models on tables, Puduppully et al (2019) explicitly model entities and Marcheggiani and Perez-Beltrachini (2018) encode structured data using graph convolutional networks. Ferreira et al (2019) find that neural pipelined approaches perform better than end-to-end models. This notion is echoed Moryossef et al (2019) who show the effectiveness of adding an explicit planning stage prior to generation.
Reference
  • Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz-a largescale multi-domain wizard-of-oz dataset for taskoriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026.
    Google ScholarLocate open access versionFindings
  • Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, and William Yang Wang. 2019a. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3696–3709.
    Google ScholarLocate open access versionFindings
  • Zhiyu Chen, Harini Eavani, Yinyin Liu, and William Yang Wang. 2019b. Few-shot nlg with pre-trained language model. arXiv preprint arXiv:1904.09521.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895.
    Google ScholarLocate open access versionFindings
  • Bayu Distiawan, Jianzhong Qi, Rui Zhang, and Wei Wang. 2018. Gtr-lstm: A triple encoder for sentence generation from rdf data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1627–1637.
    Google ScholarLocate open access versionFindings
  • Thiago Castro Ferreira, Chris van der Lee, Emiel van Miltenburg, and Emiel Krahmer. 2019. Neural datato-text generation: A comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 552–562.
    Google ScholarLocate open access versionFindings
  • Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. The webnlg challenge: Generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133.
    Google ScholarLocate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
    Google ScholarLocate open access versionFindings
  • Mihir Kale and Scott Roy. 2020. Machine translation pre-training for data-to-text generation–a case study in czech. arXiv preprint arXiv:2004.02077.
    Findings
  • Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
    Findings
  • Karen Kukich. 1983. Design of a knowledge-based report generator. In Proceedings of the 21st annual meeting on Association for Computational Linguistics, pages 145–150. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
    Google ScholarFindings
  • Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Diego Marcheggiani and Laura Perez-Beltrachini. 2018. Deep graph convolutional encoders for structured data to text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 1–9.
    Google ScholarLocate open access versionFindings
  • Kathleen R McKeown. 1985. Text generation: using discourse strategies and focus constraints to generate natural language text.
    Google ScholarFindings
  • Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2267–2277.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 20Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
    Findings
  • Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020. Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328.
    Findings
  • Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
    Findings
  • Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6908–6915.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. 2019. Leveraging pre-trained checkpoints for sequence generation tasks. arXiv preprint arXiv:1907.12461.
    Findings
  • Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083.
    Google ScholarLocate open access versionFindings
  • Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, Orhan Firat, and Karthik Raman. 2019. Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. arXiv preprint arXiv:1909.00437.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, PeiHao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745.
    Findings
  • Chao Zhao, Marilyn Walker, and Snigdha Chaturvedi. 2020. Bridging the structural gap between encoding and decoding for data-to-text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments