Few Shot Learning for Opinion Summarization

Arthur Bražinskas
Arthur Bražinskas

EMNLP 2020, pp. 4119-4135, 2020.

Cited by: 0|Bibtex|Views21|Links
Keywords:
automatic evaluationopinion summarizationmulti-tasking learningunsupervised opinion summarizationAmazon Mechanical TurkMore(9+)
Weibo:
We introduce the first to our knowledge few-shot framework for abstractive opinion summarization

Abstract:

Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. The task is practically important and has attracted a lot of attention. However, due to the high cost of summary production, datasets large enough for training supervised models are ...More

Code:

Data:

0
Introduction
  • Summarization of user opinions expressed in online resources, such as blogs, reviews, social media, or internet forums, has drawn much attention due to its potential for various information access applications, such as creating digests, search, and report
  • These shoes run true to size, do a good job supporting the arch of the foot and are well-suited for exercise.
  • They’re good looking, comfortable, and the sole feels soft and cushioned
  • Overall they are a nice, light-weight pair of shoes and come in a variety of stylish colors.
  • They run a little on the narrow side, so make sure to order a half size larger than normal
Highlights
  • Summarization of user opinions expressed in online resources, such as blogs, reviews, social media, or internet forums, has drawn much attention due to its potential for various information access applications, such as creating digests, search, and report

    Gold Ours Reviews

    These shoes run true to size, do a good job supporting the arch of the foot and are well-suited for exercise
  • We introduce the first few-shot learning framework for abstractive opinion summarization; 1For simplicity, we use the term ‘product’ to refer to both Amazon products and Yelp businesses
  • In addition to using the ROUGE scores, as was explained previously, we introduce a novelty reduction technique, which is similar to label smoothing (Pereyra et al, 2017)
  • We partitioned business/product reviews to the groups of 9 reviews by sampling without replacement
  • We introduce the first to our knowledge few-shot framework for abstractive opinion summarization
  • We demonstrate that our approach substantially outperforms competitive ones, both abstractive and extractive, in human and automatic evaluation
Methods
  • 4.1 Dataset

    For training the authors used customer reviews from Amazon (He and McAuley, 2016) and Yelp.4 From the Amazon reviews the authors selected 4 categories: Electronics; Clothing, Shoes and Jewelry; Home and Kitchen; Health and Personal Care.
  • To speed-up the training phase, the authors trained an unconditional language model for 13 epoch on the Amazon reviews with the learning rate (LR) set to 5 ∗ 10−4.
  • The authors trained the model using Eq 2 for 9 epochs on the Amazon reviews with 6 ∗ 10−5 LR, and for 57 epochs with LR set to 5 ∗ 10−5.
  • On Yelp the authors trained for 87 epochs with 1 ∗ 10−5 Lastly, the authors fine-tuned the plugin network on the human-written summaries by output matching with the oracle7.
  • 23 epochs with 1 ∗ 10−4 LR on Yelp
Results
  • Automatic Evaluation The authors report ROUGE F1 score (Lin, 2004) based evaluation results on the Amazon and Yelp test sets in Tables 3 and 4, respectively.
  • Best-Worst Scaling The authors performed human evaluation with the Best-Worst scaling (Louviere and Woodworth, 1991; Louviere et al, 2015; Kiritchenko and Mohammad, 2016) on the Amazon and Yelp test sets using the AMT platform.
  • The authors assigned multiple workers to each tuple containing summaries from COPYCAT, the model, LEXRANK, and human annotators.
Conclusion
  • The authors introduce the first to the knowledge few-shot framework for abstractive opinion summarization.
  • The authors show that it can efficiently utilize even a handful of annotated reviews-summary pairs to train models that generate fluent, informative, and overall sentiment reflecting summaries.
  • The authors propose to exploit summary related properties in unannotated reviews that are used for unsupervised training of a generator.
  • The authors show that it allows for successful cross-domain adaptation
Summary
  • Introduction:

    Summarization of user opinions expressed in online resources, such as blogs, reviews, social media, or internet forums, has drawn much attention due to its potential for various information access applications, such as creating digests, search, and report
  • These shoes run true to size, do a good job supporting the arch of the foot and are well-suited for exercise.
  • They’re good looking, comfortable, and the sole feels soft and cushioned
  • Overall they are a nice, light-weight pair of shoes and come in a variety of stylish colors.
  • They run a little on the narrow side, so make sure to order a half size larger than normal
  • Objectives:

    The authors' goal is to estimate the conditional distribution ri|r−i by optimizing the parameters θ as shown in Eq 1.
  • Methods:

    4.1 Dataset

    For training the authors used customer reviews from Amazon (He and McAuley, 2016) and Yelp.4 From the Amazon reviews the authors selected 4 categories: Electronics; Clothing, Shoes and Jewelry; Home and Kitchen; Health and Personal Care.
  • To speed-up the training phase, the authors trained an unconditional language model for 13 epoch on the Amazon reviews with the learning rate (LR) set to 5 ∗ 10−4.
  • The authors trained the model using Eq 2 for 9 epochs on the Amazon reviews with 6 ∗ 10−5 LR, and for 57 epochs with LR set to 5 ∗ 10−5.
  • On Yelp the authors trained for 87 epochs with 1 ∗ 10−5 Lastly, the authors fine-tuned the plugin network on the human-written summaries by output matching with the oracle7.
  • 23 epochs with 1 ∗ 10−4 LR on Yelp
  • Results:

    Automatic Evaluation The authors report ROUGE F1 score (Lin, 2004) based evaluation results on the Amazon and Yelp test sets in Tables 3 and 4, respectively.
  • Best-Worst Scaling The authors performed human evaluation with the Best-Worst scaling (Louviere and Woodworth, 1991; Louviere et al, 2015; Kiritchenko and Mohammad, 2016) on the Amazon and Yelp test sets using the AMT platform.
  • The authors assigned multiple workers to each tuple containing summaries from COPYCAT, the model, LEXRANK, and human annotators.
  • Conclusion:

    The authors introduce the first to the knowledge few-shot framework for abstractive opinion summarization.
  • The authors show that it can efficiently utilize even a handful of annotated reviews-summary pairs to train models that generate fluent, informative, and overall sentiment reflecting summaries.
  • The authors propose to exploit summary related properties in unannotated reviews that are used for unsupervised training of a generator.
  • The authors show that it allows for successful cross-domain adaptation
Tables
  • Table1: Example summaries produced by our system and an annotator; colors encode its alignment to the input reviews. The reviews are truncated, and delimited with the symbol ‘||’
  • Table2: Data statistics after pre-processing. The format in the cells is Businesses/Reviews and Products/Reviews for Yelp and Amazon, respectively
  • Table3: ROUGE scores on the Amazon test set
  • Table4: ROUGE scores on the Yelp test set
  • Table5: Human evaluation results in terms of the Best-Worst scaling on the Amazon test set
  • Table6: Human evaluation results in terms of the Best-Worst scaling on the Yelp test set
  • Table7: Content support on the Amazon test set
  • Table8: ROUGE scores on the Amazon test set for alternative summary adaptation strategies
  • Table9: Text characteristics of generated summaries by different models on the Amazon test set
  • Table10: Example summaries produced by models with different adaptation approaches
  • Table11: In and cross domain experiments on the Amazon dataset, ROUGE-L scores are reported
  • Table12: Examples of review sentences that contain only pronouns belonging to a specific class
  • Table13: Example summaries produced by different systems on Yelp data
  • Table14: Table 14
  • Table15: Example summaries produced by different systems on Amazon data
Download tables as Excel
Related work
  • Extractive weakly-supervised opinion summarization has been an active area of research. LEXRANK (Erkan and Radev, 2004) is an unsupervised extractive model. OPINOSIS (Ganesan et al, 2010) does not use any supervision and relies on POS tags and redundancies to generate short opinions. However, this approach is not well suited for the generation of coherent long summaries and, although it can recombine fragments of input text, it cannot generate novel words and phrases. Other earlier approaches (Gerani et al, 2014; Di Fabbrizio et al, 2014) relied on text planners and templates, which restrict the output text. A more recent extractive method of Angelidis and Lapata (2018) frames the problem as a pipeline of steps with different models for each step. Isonuma et al (2019) introduce an unsupervised approach for single product review summarization, where they rely on latent discourse trees. The most related unsupervised approach to this work is our own work, COPYCAT (Brazinskas et al, 2020). Unlike that work, we rely on a powerful generator to learn conditional spaces of text without hierarchical latent variables. Finally, in contract to MEANSUM (Chu and Liu, 2019), our model relies on inductive biases without explicitly modeling of summaries. A concurrent model DENOISESUM (Amplayo and Lapata, 2020) uses a syntactically generated dataset of source reviews to train a generator to denoise and distill common information. Another parallel work, OPINIONDIGEST (Suhara et al, 2020), considers controllable opinion aggregation and is a pipeline framework for abstractive summary generation. Our conditioning on text properties approach is similar to Ficler and Goldberg (2017), yet we rely on automatically derived properties that associate a target to source, and learn a separate module to generate their combinations. Moreover, their method has not been studied in the context of summarization.
Funding
  • We gratefully acknowledge the support of the European Research Council (Titov: ERC StG BroadSem 678254; Lapata: ERC CoG TransModal 681760) and the Dutch National Science Foundation (NWO VIDI 639.022.518)
Study subjects and analysis
AMT workers: 3
We split summaries generated by our model and COPYCAT into sentences. Then for each summary sentence, we hired 3 AMT workers to judge how well content of the sentence is supported by the reviews. Three following options were available

Reference
  • Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Stefanos Angelidis and Mirella Lapata. 2018. Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3675–3686.
    Google ScholarLocate open access versionFindings
  • Julian Besag. 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician), 24(3):179–195.
    Google ScholarLocate open access versionFindings
  • John Blitzer, Mark Dredze, and Fernando Pereira. 2007.
    Google ScholarFindings
  • Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447.
    Google ScholarLocate open access versionFindings
  • Arthur Brazinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised opinion summarization as copycat-review generation. In Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Eric Chu and Peter Liu. 2019. Meansum: a neural model for unsupervised multi-document abstractive summarization. In Proceedings of International Conference on Machine Learning (ICML), pages 1223–1232.
    Google ScholarLocate open access versionFindings
  • Hoa Trang Dang. 2005. Overview of duc 2005. In Proceedings of the document understanding conference, volume 2005, pages 1–12.
    Google ScholarLocate open access versionFindings
  • Giuseppe Di Fabbrizio, Amanda Stent, and Robert Gaizauskas. 2014. A hybrid approach to multidocument summarization of opinions in reviews. pages 54–63.
    Google ScholarFindings
  • Gunes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research, 22:457–479.
    Google ScholarLocate open access versionFindings
  • Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
    Google ScholarLocate open access versionFindings
  • Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 340–348.
    Google ScholarLocate open access versionFindings
  • Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond T Ng, and Bita Nejat. 2014. Abstractive summarization of product reviews using discourse structure. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1602–1613.
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
    Google ScholarLocate open access versionFindings
  • Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517.
    Google ScholarLocate open access versionFindings
  • Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. 2019. Efficient adaptation of pretrained transformers for abstractive summarization. arXiv preprint arXiv:1906.00138.
    Findings
  • Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM.
    Google ScholarLocate open access versionFindings
  • Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka Matsuo, and Ichiro Sakata. 2017. Extractive summarization using multi-task learning with document classification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2101–2110.
    Google ScholarLocate open access versionFindings
  • Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. 2019. Unsupervised neural single-document summarization of reviews via learning latent discourse structure and its ranking. In Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Svetlana Kiritchenko and Saif M Mohammad. 2016. Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 811–817.
    Google ScholarLocate open access versionFindings
  • Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.
    Google ScholarFindings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries acl. In Proceedings of Workshop on Text Summarization Branches Out Post Conference Workshop of ACL, pages 2017–05.
    Google ScholarLocate open access versionFindings
  • Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In Proceedings of International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • Jordan J Louviere, Terry N Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press.
    Google ScholarFindings
  • Jordan J Louviere and George G Woodworth. 1991. Best-worst scaling: A model for the largest difference judgments. University of Alberta: Working Paper.
    Google ScholarFindings
  • Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4):1093–1113.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290.
    Google ScholarLocate open access versionFindings
  • Bryan Orme. 2009. Maxdiff analysis: Simple counting, individual-level logit, and hb. Sequim, WA: Sawtooth Software.
    Google ScholarFindings
  • Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
    Findings
  • Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548.
    Findings
  • Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 157–163.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
    Findings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389.
    Google ScholarLocate open access versionFindings
  • Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointergenerator networks. In Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, and Wang-Chiew Tan. 2020. Opiniondigest: A simple framework for opinion summarization. Proceedings of Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Wenyi Tay, Aditya Joshi, Xiuzhen Jenny Zhang, Sarvnaz Karimi, and Stephen Wan. 2019. Red-faced rouge: Examining the suitability of rouge for opinion summary evaluation. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association, pages 52–60.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638.
    Google ScholarLocate open access versionFindings
  • Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270– 280.
    Google ScholarLocate open access versionFindings
  • Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. 2017. Deep sets. In Advances in neural information processing systems, pages 3391– 3401.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments