Pre training for Abstractive Document Summarization by Reinstating Source Text
empirical methods in natural language processing, pp. 3646-3660, 2020.
Keywords:
model pre trainingabstractive document summarizationabstractive summarizationrelative orderNext Sentence GenerationMore(15+)
Weibo:
Abstract:
Abstractive document summarization is usually modeled as a sequence-to-sequence (SEQ2SEQ) learning problem. Unfortunately, training large SEQ2SEQ based summarization models on limited supervised summarization data is challenging. This paper presents three sequence-to-sequence pre-training (in shorthand, STEP) objectives which allow us to ...More
Code:
Data:
Introduction
- Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases.
- This task can be typically classified into two categories: extractive and abstractive document summarization.
- As far as the authors know, relatively rare prior work has studied this for abstractive summarization
Highlights
- Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases
- This task can be typically classified into two categories: extractive and abstractive document summarization
- Abstractive summarization (Nallapati et al, 2016; See et al, 2017; Paulus et al, 2018) rewrites the source text and generates the corresponding summary which may contain novel words and phrases not featured in the input
- The output summary is closely related to the input document
- Summary sentences, paraphrased from the input by the abstractive summarizers, might have a different relative order compared to the source text
- We proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation
Methods
- 3.1 Sequence-to-Sequence Learning
In this work, the task of abstractive document summarization is modeled as a SEQ2SEQ learning problem. - Given the whole training set (X , Y), this model can be trained by maximizing the log-likelihood of the training document-summary pairs: L(θ; X , Y) =.
- The authors first pre-train the SEQ2SEQ Transformer model on the unlabeled text using the proposed pre-training objectives and fine-tune it on the document-summary dataset.
- If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, the authors count this as one content reordering.
- According to the statistics on the training split of the summarization dataset, contents of the original documents are reordered in their summaries for 40% of cases, approximately
Results
- 5.1 Automatic Evaluation
The authors used ROUGE (Lin, 2004) to measure the quality of different summarization model outputs. - PTGen (See et al, 2017) DRM (Paulus et al, 2018) BottomUp (Gehrmann et al, 2018) DCA (Celikyilmaz et al, 2018) BERTAbs (Liu and Lapata, 2019) UniLM (Dong et al, 2019) TRANSFORMER-S2S ROBERTABASE-S2S ROBERTA-S2S ROBERTACONT-S2S Ours.
- The first and second blocks show results of previous extractive and abstractive models, respectively.
- Similar to the trends in CNNDM, the method leads to significant performance gains.
Conclusion
- The authors proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation.
- All those objectives have relations with abstractive summarization task and are designed based on reinstating the source text.
- A SEQ2SEQ model for abstractive document summarization can be pre-trained using such objectives and fine-tuned on the summarization dataset.
- The authors would like to investigate other objectives to pre-train SEQ2SEQ models for abstractive summarization
Summary
Introduction:
Automatic document summarization is the task of condensing a document into its shorter form with important content preserved, which requires widecoverage understandings of the document, rather than specific words or phrases.- This task can be typically classified into two categories: extractive and abstractive document summarization.
- As far as the authors know, relatively rare prior work has studied this for abstractive summarization
Objectives:
Song et al (2019) tested their model on sentence-level tasks, while the authors aim to solve document-level tasks.- Given a document X = (x1, x2, .
- X|X|) paired with its summary Y = (y1, y2, .
- Y|Y |), the authors aim to learn the model parameters θ and estimate the conditional probability:
Methods:
3.1 Sequence-to-Sequence Learning
In this work, the task of abstractive document summarization is modeled as a SEQ2SEQ learning problem.- Given the whole training set (X , Y), this model can be trained by maximizing the log-likelihood of the training document-summary pairs: L(θ; X , Y) =.
- The authors first pre-train the SEQ2SEQ Transformer model on the unlabeled text using the proposed pre-training objectives and fine-tune it on the document-summary dataset.
- If the relative orders of sentences in the summary are different from the relative orders of their mapped sentences in the original document, the authors count this as one content reordering.
- According to the statistics on the training split of the summarization dataset, contents of the original documents are reordered in their summaries for 40% of cases, approximately
Results:
5.1 Automatic Evaluation
The authors used ROUGE (Lin, 2004) to measure the quality of different summarization model outputs.- PTGen (See et al, 2017) DRM (Paulus et al, 2018) BottomUp (Gehrmann et al, 2018) DCA (Celikyilmaz et al, 2018) BERTAbs (Liu and Lapata, 2019) UniLM (Dong et al, 2019) TRANSFORMER-S2S ROBERTABASE-S2S ROBERTA-S2S ROBERTACONT-S2S Ours.
- The first and second blocks show results of previous extractive and abstractive models, respectively.
- Similar to the trends in CNNDM, the method leads to significant performance gains.
Conclusion:
The authors proposed three sequence-to-sequence pretraining objectives, including sentence reordering, sentence generation, and masked document generation.- All those objectives have relations with abstractive summarization task and are designed based on reinstating the source text.
- A SEQ2SEQ model for abstractive document summarization can be pre-trained using such objectives and fine-tuned on the summarization dataset.
- The authors would like to investigate other objectives to pre-train SEQ2SEQ models for abstractive summarization
Tables
- Table1: The number of document-summary pairs (for CNNDM and NYT) and unlabeled documents (for GIGA-CM)
- Table2: Results on the test split of CNNDM using fulllength F1 based ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L). ∗ indicates significant improvements (p < 0.05 measured with the ROUGE script) compared to models in the first two blocks
- Table3: Results on the test set of NYT dataset using limited-length recall based ROUGE. ∗ indicates significant improvements (p < 0.05 measured with the ROUGE script) to models in the first two blocks
- Table4: Results on the CNNDM test split of models pre-trained on different corpora. ∗ indicates significant differences from our model
- Table5: Human evaluation results: proportions of system rankings. MR: mean rank (the lower the better)
Related work
- Extractive Summarization This task aims to find the informative sentences in a document as its summary. This task is usually viewed as a sentence ranking problem (Kupiec et al, 1995; Conroy and O’leary, 2001) using scores from a binary (sequence) classification model, which predicts whether a sentence is in the summary or not. Extractive neural models (Cheng and Lapata, 2016; Nallapati et al, 2017; Narayan et al, 2018b; Zhang et al, 2018) employ hierarchical LSTMs/CNNs as the feature learning part of the binary (sequence) classifier, which largely outperform discrete feature based models (Radev et al, 2004; Filatova and Hatzivassiloglou, 2004; Nenkova et al, 2006). Very recently, the feature learning part was replaced again with pre-trained Transformer encoders (Zhang et al, 2019b; Liu and Lapata, 2019) that lead to another huge performance gain. However, extractive models have their own limitations. For example, the extracted sentences might be too long and redundant. Besides, manually written summaries in their nature are abstractive. Therefore, we focus on abstractive summarization in this paper.
Funding
- Experiments show that, even pre-training on documents from the training split of a summarization dataset, our method can improve performance upon a heavily tuned large SEQ2SEQ Transformer model which already includes a strong pre-trained encoder by a large margin
- Even though we merely use the indomain training split (around 1GB), our method still significantly outperforms UniLM (Dong et al, 2019) that is pre-trained on 16GB data
- As listed in Table 4 (bottom part), our model significantly outperforms such two models, though we only use 19GB data for pre-training
Study subjects and analysis
document-summary pairs: 287226
Following previous work (See et al, 2017; Liu and Lapata, 2019), we use the non-anonymized version of CNNDM. Specifically, we preprocessed the dataset with the publicly available scripts3 provided by See et al (2017) and obtained 287,226 document-summary pairs for training, 13,368 for validation and 11,490 for test. NYT The NYT dataset (Sandhaus, 2008) is a collection of articles along with multi-sentence summaries written by library scientists
articles: 9076
NYT The NYT dataset (Sandhaus, 2008) is a collection of articles along with multi-sentence summaries written by library scientists. Following the preprocessing procedures described in (Durrett et al, 2016; Liu and Lapata, 2019), the test set is constructed by including all articles published on January 1, 2007 or later, which contains 9,076 articles. The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples
articles: 100834
Following the preprocessing procedures described in (Durrett et al, 2016; Liu and Lapata, 2019), the test set is constructed by including all articles published on January 1, 2007 or later, which contains 9,076 articles. The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples. Following (Durrett et al, 2016), we also removed articles whose summaries contain less than 50 words from the test set, and the resulting test set contains 3,452 examples
documents: 6521658
GIGA-CM To pre-train our model with the objectives introduced in Section 3.2, following the procedures in Zhang et al (2019b), we created the GIGA-CM dataset, which contains only unlabeled documents. The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset4 and the training documents in CNNDM, resulting in 19GB text for pretraining. We used the 13,368 documents in the validation split of CNNDM as the validation set
documents: 13368
The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset4 and the training documents in CNNDM, resulting in 19GB text for pretraining. We used the 13,368 documents in the validation split of CNNDM as the validation set. Note that the Gigaword dataset overlaps with the NYT dataset and we therefore excluded the test set of NYT from the training set of GIGA-CM
documents: 50
Since summaries generated by abstractive models may produce disfluent or ungrammatical outputs, we also evaluated abstractive systems by eliciting human judgements. We compared our best preforming model (i.e., pre-training on the GIGACM dataset using SR objective) with human references (denoted as Gold), as well as several strong baselines whose system outputs are available to us, including RoBERTa-S2S, and two pre-training based models, i.e., BERTAbs (Liu and Lapata, 2019) and UniLM (Dong et al, 2019)9. 50 documents are randomly sampled from the test split of CNNDM. 10 participants are presented with a document and a list of outputs generated by different abstractive summarization systems. Then they are asked to rank the outputs of these systems from best to worst according to informativeness (does the summary capture the informative part of the document?), fluency (is the summary grammatical?), and succinctness (does the summary express the document clearly in a few words?) We report the proportions of system rankings and mean rank (lower is better) in Table 5
Reference
- Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
- Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proc. of NAACL.
- Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proc. of ACL.
- Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proc. of ACL.
- John M Conroy and Dianne P O’leary. 2001. Text summarization via hidden markov models. In Proc. of SIGIR.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of ACL.
- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Proc. of NIPS.
- Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. In Proc. of ACL.
- Elena Filatova and Vasileios Hatzivassiloglou. 2004. Event-based extractive summarization. In Text Summarization Branches Out.
- Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018. Bottom-up abstractive summarization. In Proc. of EMNLP.
- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proc. of ACL.
- Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proc. of NIPS.
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, Kerui Min, Jing Tang, and Min Sun. 2018. A unified model for extractive and abstractive summarization using inconsistency loss. In Proc. of ACL.
- Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
- Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In Proc. of SIGIR.
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. of ACLWorkshop.
- Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proc. of EMNLP.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. In Proc. of ACL.
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proc. of ACL: System Demonstrations.
- Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proc. of AAAI.
- Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proc. of SIGNLL.
- Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018a. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proc. of EMNLP.
- Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018b. Ranking sentences for extractive summarization with reinforcement learning. In Proc. of NAACL.
- Ani Nenkova, Lucy Vanderwende, and Kathleen McKeown. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proc. of SIGIR.
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL: Demonstrations).
- Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proc. of ICLR.
- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Dragomir Radev, Timothy Allison, Sasha BlairGoldensohn, John Blitzer, Arda Celebi, Stanko Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, Danyu Liu, Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Winkel, and Zhu Zhang. 2004. MEAD - a platform for multidocument multilingual text summarization. In Proc. of LREC.
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
- Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proc. of ACL.
- Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In Proc. of ICML.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. of NIPS.
- Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. Prophetnet: Predicting future ngram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. 2019a. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777.
- Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. 2018. Neural latent extractive document summarization. In Proc. of ACL.
- Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proc. of ACL.
Tags
Comments