Multi-Stage Influence Function

NeurIPS 2020, 2020.

Cited by: 0|Bibtex|Views16|Links
Keywords:
pretraining datuminverse Hessian vector producmulti stage influence functionmulti stage trainingconjugate gradientsMore(6+)
Weibo:
We develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data

Abstract:

Multi-stage training and knowledge transfer, from a large-scale pretraining task to various finetuning tasks, have revolutionized natural language processing and computer vision resulting in state-of-the-art performance improvements. In this paper, we develop a multi-stage influence function score to track predictions from a finetuned m...More

Code:

Data:

0
Introduction
  • Multi-stage training has become increasingly important and has achieved state-of-the-art results in many tasks.
  • The successes of these multi-stage learning paradigms are due to knowledge transfer from pretraining tasks to the end task.
  • Which part of the pretraining data/task contributes most to the end task?
  • Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper
  • Which part of the pretraining data/task contributes most to the end task? How can one detect “false transfer” where some pretraining data/task could be harmful for the end task? If a testing point is wrongly predicted by the finetuned model, can the authors trace back to the problematic examples in the pretraining data? Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper
Highlights
  • Multi-stage training has become increasingly important and has achieved state-of-the-art results in many tasks
  • Similar ideas in transfer learning have been widely used in many different tasks. The successes of these multi-stage learning paradigms are due to knowledge transfer from pretraining tasks to the end task
  • Which part of the pretraining data/task contributes most to the end task? How can one detect “false transfer” where some pretraining data/task could be harmful for the end task? If a testing point is wrongly predicted by the finetuned model, can we trace back to the problematic examples in the pretraining data? Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper
  • We show that the influence of the pretraining data to the finetuned model consists of two parts: the influence of the pretraining data on the pretrained model, and influence of the pretraining data on the finetuned model
  • Our experimental results on CV and natural language processing (NLP) tasks show strong correlation between the score of an example, computed from the proposed multi-stage influence function, and the true loss difference when the example is removed from the pretraining data
  • We believe our multi-stage influence function is a promising approach to connect the performance of a finetuned model with pretraining data
Methods
  • The authors will conduct experiment on real datasets in both vision and NLP tasks to show the effectiveness of the proposed method.
Results
  • The authors show the application of the proposed model on NLP task.
  • In this experiment, the pretraining task is training ELMo [20] model on the one-billion-word (OBW) dataset [3] which contains 30 million sentences and 8 million unique words.
  • The final pretrained ELMo model contains 93.6 million parameters.
  • Test examples are from a binary sentiment calssification task of Twitter.
Conclusion
  • The authors introduce a multi-stage influence function for two multi-stage training setups: 1) the pretrained embedding is fixed during finetuning, and 2) the pretrained embedding is updated during finetuning.
  • The authors' experimental results on CV and NLP tasks show strong correlation between the score of an example, computed from the proposed multi-stage influence function, and the true loss difference when the example is removed from the pretraining data.
  • The authors believe the multi-stage influence function is a promising approach to connect the performance of a finetuned model with pretraining data
Summary
  • Introduction:

    Multi-stage training has become increasingly important and has achieved state-of-the-art results in many tasks.
  • The successes of these multi-stage learning paradigms are due to knowledge transfer from pretraining tasks to the end task.
  • Which part of the pretraining data/task contributes most to the end task?
  • Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper
  • Which part of the pretraining data/task contributes most to the end task? How can one detect “false transfer” where some pretraining data/task could be harmful for the end task? If a testing point is wrongly predicted by the finetuned model, can the authors trace back to the problematic examples in the pretraining data? Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper
  • Methods:

    The authors will conduct experiment on real datasets in both vision and NLP tasks to show the effectiveness of the proposed method.
  • Results:

    The authors show the application of the proposed model on NLP task.
  • In this experiment, the pretraining task is training ELMo [20] model on the one-billion-word (OBW) dataset [3] which contains 30 million sentences and 8 million unique words.
  • The final pretrained ELMo model contains 93.6 million parameters.
  • Test examples are from a binary sentiment calssification task of Twitter.
  • Conclusion:

    The authors introduce a multi-stage influence function for two multi-stage training setups: 1) the pretrained embedding is fixed during finetuning, and 2) the pretrained embedding is updated during finetuning.
  • The authors' experimental results on CV and NLP tasks show strong correlation between the score of an example, computed from the proposed multi-stage influence function, and the true loss difference when the example is removed from the pretraining data.
  • The authors believe the multi-stage influence function is a promising approach to connect the performance of a finetuned model with pretraining data
Tables
  • Table1: Exmaples of test sentences and pretraining sentences with the largest and the smallest absolute influence function score values in our subset of pretraining data. The subset consists of
Download tables as Excel
Related work
  • Multi-stage model training that trains models in many stages on different tasks to improve the end-task has been used widely in many machine learning areas. For example, transfer learning has been widely used to transfer knowledge from source task to the target task [18]. More recently, researchers have shown that training the computer vision or NLP encoder on a source task with huge amount of data can often benefit the performance of small end-tasks, and these techniques including BERT [7], Elmo [14] and large ResNet pretraining [15] have achieved state-of-the-arts on many tasks.

    Although mutli-stage models have been widely used, there are few works on understanding multistage models and exploiting the influence of the training data in the pretraining step to benefit the fine-tune task. In contrast, there are many works that focus on understanding single stage machine learning models and explaining model predictions. Algorithms developed along this line of research can be categorized into features based and data based approaches. Feature based approaches aim to explain predictions with respect to model variables, and trace back the contribution of variables to the prediction [17, 9, 21, 23, 22, 25, 8, 6, 1]. However, they are not aiming for attributing the prediction back to the training data.
Funding
  • After we removed the top 10% highest influence scores (positive values) examples from pretrain (source data), we can improve the accuracy on target data from 58.15% to 58.36%
Study subjects and analysis
tweets: 16654
The final pretrained ELMo model contains 93.6 million parameters. The finetuning task is a binary sentiment classification task on the First GOP Debate Twitter Sentiment data1 containing 16,654 tweets about the early August GOP debate in Ohio. 1https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment

Reference
  • Marco Ancona, Enea Ceolini, Cengiz Oztireli, and Markus Gross. A unified view of gradientbased attribution methods for deep neural networks. International Conference on Learning Representations, 2018.
    Google ScholarLocate open access versionFindings
  • Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, and Richard Zemel. Understanding the origins of bias in word embeddings. In ICML, pages 803–811, 2019.
    Google ScholarLocate open access versionFindings
  • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
    Findings
  • BRUCE CHRISTIANSON. Automatic Hessians by reverse accumulation. IMA Journal of Numerical Analysis, 12(2):135–150, 04 1992.
    Google ScholarLocate open access versionFindings
  • R. Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980.
    Google ScholarLocate open access versionFindings
  • Piotr Dabkowski and Yarin Gal. Real time image saliency for black box classifiers. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3449– 3457, 2017.
    Google ScholarLocate open access versionFindings
  • Tian Guo, Tao Lin, and Nino Antulov-Fantulin. Exploring interpretable LSTM neural networks over multi-variable data. In Proceedings of the 36th International Conference on Machine Learning, pages 2494–2504, 2019.
    Google ScholarLocate open access versionFindings
  • Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. Data cleansing for models trained with sgd. In Advances in Neural Information Processing Systems 32, pages 4213–4222. 2019.
    Google ScholarLocate open access versionFindings
  • Rajiv Khanna, Been Kim, Joydeep Ghosh, and Sanmi Koyejo. Interpreting black box predictions using fisher kernels. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, volume 89, pages 3382–3390.
    Google ScholarLocate open access versionFindings
  • Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, and Percy Liang. On the accuracy of influence functions for measuring group effects. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zero-data learning of new tasks. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pages 646–651, 2008.
    Google ScholarLocate open access versionFindings
  • Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    Google ScholarLocate open access versionFindings
  • Jose Oramas, Kaili Wang, and Tinne Tuytelaars. Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
    Findings
  • Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013.
    Google ScholarFindings
  • Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
    Findings
  • Jacob Steinhardt, Pang Wei Koh, and Percy Liang. Certified defenses for data poisoning attacks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3520–3532, 2017.
    Google ScholarLocate open access versionFindings
  • Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Gradients of counterfactuals. CoRR, abs/1611.02639, 2016.
    Findings
  • Hao Wang, Berk Ustun, and Flávio P. Calmon. Repairing without retraining: Avoiding disparate impact with counterfactual distributions. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 6618–6627, 2019.
    Google ScholarLocate open access versionFindings
  • Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer point selection for explaining deep neural networks. In Advances in Neural Information Processing Systems 31, pages 9291–9301. 2018.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments