Pre training for Abstractive Document Summarization by Reinstating Source Text

empirical methods in natural language processing, pp. 3646-3660, 2020.

Cited by: 0|Bibtex|Views22|Links
Keywords:
model pre trainingabstractive document summarizationabstractive summarizationrelative orderNext Sentence GenerationMore(15+)
Weibo:
We proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation

Abstract:

Abstractive document summarization is usually modeled as a sequence-to-sequence (SEQ2SEQ) learning problem. Unfortunately, training large SEQ2SEQ based summarization models on limited supervised summarization data is challenging. This paper presents three sequence-to-sequence pre-training (in shorthand, STEP) objectives which allow us to ...More

Code:

Data:

0
Introduction
  • Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases.
  • This task can be typically classified into two categories: extractive and abstractive document summarization.
  • As far as the authors know, relatively rare prior work has studied this for abstractive summarization
Highlights
  • Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases
  • This task can be typically classified into two categories: extractive and abstractive document summarization
  • Abstractive summarization (Nallapati et al, 2016; See et al, 2017; Paulus et al, 2018) rewrites the source text and generates the corresponding summary which may contain novel words and phrases not featured in the input
  • The output summary is closely related to the input document
  • Summary sentences, paraphrased from the input by the abstractive summarizers, might have a different relative order compared to the source text
  • We proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation
Methods
  • 3.1 Sequence-to-Sequence Learning

    In this work, the task of abstractive document summarization is modeled as a SEQ2SEQ learning problem.
  • Given the whole training set (X , Y), this model can be trained by maximizing the log-likelihood of the training document-summary pairs: L(θ; X , Y) =.
  • The authors first pre-train the SEQ2SEQ Transformer model on the unlabeled text using the proposed pre-training objectives and fine-tune it on the document-summary dataset.
  • If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, the authors count this as one content reordering.
  • According to the statistics on the training split of the summarization dataset, contents of the original documents are reordered in their summaries for 40% of cases, approximately
Results
Conclusion
  • The authors proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation.
  • All those objectives have relations with abstractive summarization task and are designed based on reinstating the source text.
  • A SEQ2SEQ model for abstractive document summarization can be pre-trained using such objectives and fine-tuned on the summarization dataset.
  • The authors would like to investigate other objectives to pre-train SEQ2SEQ models for abstractive summarization
Summary
  • Introduction:

    Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases.
  • This task can be typically classified into two categories: extractive and abstractive document summarization.
  • As far as the authors know, relatively rare prior work has studied this for abstractive summarization
  • Objectives:

    Song et al (2019) tested their model on sentence-level tasks, while the authors aim to solve document-level tasks.
  • Given a document X = (x1, x2, .
  • X|X|) paired with its summary Y = (y1, y2, .
  • Y|Y |), the authors aim to learn the model parameters θ and estimate the conditional probability:
  • Methods:

    3.1 Sequence-to-Sequence Learning

    In this work, the task of abstractive document summarization is modeled as a SEQ2SEQ learning problem.
  • Given the whole training set (X , Y), this model can be trained by maximizing the log-likelihood of the training document-summary pairs: L(θ; X , Y) =.
  • The authors first pre-train the SEQ2SEQ Transformer model on the unlabeled text using the proposed pre-training objectives and fine-tune it on the document-summary dataset.
  • If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, the authors count this as one content reordering.
  • According to the statistics on the training split of the summarization dataset, contents of the original documents are reordered in their summaries for 40% of cases, approximately
  • Results:

    5.1 Automatic Evaluation

    The authors used ROUGE (Lin, 2004) to measure the quality of different summarization model outputs.
  • PTGen (See et al, 2017) DRM (Paulus et al, 2018) BottomUp (Gehrmann et al, 2018) DCA (Celikyilmaz et al, 2018) BERTAbs (Liu and Lapata, 2019) UniLM (Dong et al, 2019) TRANSFORMER-S2S ROBERTABASE-S2S ROBERTA-S2S ROBERTACONT-S2S Ours.
  • The first and second blocks show results of previous extractive and abstractive models, respectively.
  • Similar to the trends in CNNDM, the method leads to significant performance gains.
  • Conclusion:

    The authors proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation.
  • All those objectives have relations with abstractive summarization task and are designed based on reinstating the source text.
  • A SEQ2SEQ model for abstractive document summarization can be pre-trained using such objectives and fine-tuned on the summarization dataset.
  • The authors would like to investigate other objectives to pre-train SEQ2SEQ models for abstractive summarization
Tables
  • Table1: The number of document-summary pairs (for CNNDM and NYT) and unlabeled documents (for GIGA-CM)
  • Table2: Results on the test split of CNNDM using fulllength F1 based ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L). ∗ indicates significant improvements (p < 0.05 measured with the ROUGE script) compared to models in the first two blocks
  • Table3: Results on the test set of NYT dataset using limited-length recall based ROUGE. ∗ indicates significant improvements (p < 0.05 measured with the ROUGE script) to models in the first two blocks
  • Table4: Results on the CNNDM test split of models pre-trained on different corpora. ∗ indicates significant differences from our model
  • Table5: Human evaluation results: proportions of system rankings. MR: mean rank (the lower the better)
Download tables as Excel
Related work
Funding
  • Experiments show that, even pre-training on documents from the training split of a summarization dataset, our method can improve performance upon a heavily tuned large SEQ2SEQ Transformer model which already includes a strong pre-trained encoder by a large margin
  • Even though we merely use the indomain training split (around 1GB), our method still significantly outperforms UniLM (Dong et al, 2019) that is pre-trained on 16GB data
  • As listed in Table 4 (bottom part), our model significantly outperforms such two models, though we only use 19GB data for pre-training
Study subjects and analysis
document-summary pairs: 287226
Following previous work (See et al, 2017; Liu and Lapata, 2019), we use the non-anonymized version of CNNDM. Specifically, we preprocessed the dataset with the publicly available scripts3 provided by See et al (2017) and obtained 287,226 document-summary pairs for training, 13,368 for validation and 11,490 for test. NYT The NYT dataset (Sandhaus, 2008) is a collection of articles along with multi-sentence summaries written by library scientists

articles: 9076
NYT The NYT dataset (Sandhaus, 2008) is a collection of articles along with multi-sentence summaries written by library scientists. Following the preprocessing procedures described in (Durrett et al, 2016; Liu and Lapata, 2019), the test set is constructed by including all articles published on January 1, 2007 or later, which contains 9,076 articles. The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples

articles: 100834
Following the preprocessing procedures described in (Durrett et al, 2016; Liu and Lapata, 2019), the test set is constructed by including all articles published on January 1, 2007 or later, which contains 9,076 articles. The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples. Following (Durrett et al, 2016), we also removed articles whose summaries contain less than 50 words from the test set, and the resulting test set contains 3,452 examples

documents: 6521658
GIGA-CM To pre-train our model with the objectives introduced in Section 3.2, following the procedures in Zhang et al (2019b), we created the GIGA-CM dataset, which contains only unlabeled documents. The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset4 and the training documents in CNNDM, resulting in 19GB text for pretraining. We used the 13,368 documents in the validation split of CNNDM as the validation set

documents: 13368
The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset4 and the training documents in CNNDM, resulting in 19GB text for pretraining. We used the 13,368 documents in the validation split of CNNDM as the validation set. Note that the Gigaword dataset overlaps with the NYT dataset and we therefore excluded the test set of NYT from the training set of GIGA-CM

documents: 50
Since summaries generated by abstractive models may produce disfluent or ungrammatical outputs, we also evaluated abstractive systems by eliciting human judgements. We compared our best preforming model (i.e., pre-training on the GIGACM dataset using SR objective) with human references (denoted as Gold), as well as several strong baselines whose system outputs are available to us, including RoBERTa-S2S, and two pre-training based models, i.e., BERTAbs (Liu and Lapata, 2019) and UniLM (Dong et al, 2019)9. 50 documents are randomly sampled from the test split of CNNDM. 10 participants are presented with a document and a list of outputs generated by different abstractive summarization systems. Then they are asked to rank the outputs of these systems from best to worst according to informativeness (does the summary capture the informative part of the document?), fluency (is the summary grammatical?), and succinctness (does the summary express the document clearly in a few words?) We report the proportions of system rankings and mean rank (lower is better) in Table 5

Reference
  • Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • John M Conroy and Dianne P O’leary. 2001. Text summarization via hidden markov models. In Proc. of SIGIR.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Proc. of NIPS.
    Google ScholarLocate open access versionFindings
  • Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Elena Filatova and Vasileios Hatzivassiloglou. 2004. Event-based extractive summarization. In Text Summarization Branches Out.
    Google ScholarFindings
  • Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proc. of NIPS.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In Proc. of SIGIR.
    Google ScholarLocate open access versionFindings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of ACLWorkshop.
    Google ScholarLocate open access versionFindings
  • Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proc. of ACL: System Demonstrations.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proc. of AAAI.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proc. of SIGNLL.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proc. of EMNLP.
    Google ScholarLocate open access versionFindings
  • Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Ani Nenkova, Lucy Vanderwende, and Kathleen McKeown. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proc. of SIGIR.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL: Demonstrations).
    Google ScholarLocate open access versionFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proc. of ICLR.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
    Google ScholarLocate open access versionFindings
  • Dragomir Radev, Timothy Allison, Sasha BlairGoldensohn, John Blitzer, Arda Celebi, Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Winkel, and Zhu Zhang. 2004. MEAD - a platform for multidocument multilingual text summarization. In Proc. of LREC.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In Proc. of ICML.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NIPS.
    Google ScholarLocate open access versionFindings
  • Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. Prophetnet: Predicting future ngram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. 2019a. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777.
    Findings
  • Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018. Neural latent extractive document summarization. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
  • Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proc. of ACL.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments