Less Is Better: Unweighted Data Subsampling via Influence Function

Zifeng Wang
Zifeng Wang
Hong Zhu
Hong Zhu
Zhenhua Dong
Zhenhua Dong

national conference on artificial intelligence, 2020.

Cited by: 1|Bibtex|Views88|Links
Keywords:
empirical distributioninfluence data subsamplingprobabilistic samplingnovel unweighteddata subsamplingMore(18+)
Weibo:
We theoretically study the unweighted subsampling with influence function, propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods

Abstract:

In the time of \emph{Big Data}, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods designed to help the performance of subset-model approach the full-set-model, hence the weight...More

Code:

Data:

0
Introduction
  • Bigger data can probably train a better model.
  • It is almost the common sense nowadays for machine learning and deep learning practitioners.
  • In the classical Empirical Risk Minimization (ERM) theory, it assumes that the training samples and test samples are i.i.d drawn from the same distribution.
  • The distribution shift from P to Q violates the ERM’s basic assumption; 2) unknown noise in data and its label is common in reality, causing some examples harmful for model’s performance (Szegedy et al 2014; Zhang et al 2019); 3) training on large data sets imposes significant burden on computation, some large-scale deep learning models require hundreds even thousands of GPUs
Highlights
  • Bigger data can probably train a better model
  • The distribution shift from P to Q violates the ERM’s basic assumption; 2) unknown noise in data and its label is common in reality, causing some examples harmful for model’s performance (Szegedy et al 2014; Zhang et al 2019); 3) training on large data sets imposes significant burden on computation, some large-scale deep learning models require hundreds even thousands of GPUs
  • Instead of approaching the full-set-model θ, we prove that the model θtrained by a selected subset through our subsampling method can outperform the θ; second, we propose several probabilistic sampling functions, and analyze how the sampling function influences the worst-case risk (Bagnell 2005) changes over a χ2-divergence ball
  • We further propose a surrogate metric to measure the confidence degree of the sampling methods over the observed distribution, which is useful for evaluating model’s generalization ability on a set of distributions; third, for the sake of implementation efficiency, the Hessian-free mixed Preconditioned Conjugate Gradient (PCG) method is used to compute the influence function (IF) in sparse scenarios; last, complete experiments are conducted on diverse tasks to demonstrate our methods superiority over the existing state-of-the-art subsampling methods1
  • Our experiments demonstrate that the Mixed PCG is efficacious for speeding up the calculation of IF
  • We theoretically study the unweighted subsampling with IF, propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods
Methods
  • The key challenge lies on that the θmay not be the best risk minimizer corresponding to Q , due to the distribution shift between P and Q and unknown noisy samples in the training set.
  • The essential philosophy here is, given a test distribution Q , some samples in training set cause increasing test risk.
  • If they are downweighted, the authors can have less test risk than before, namely Rθ(Q ) ≤ Rθ(Q ) where the θis the new model learned after some harmful samples are downweighted.
  • Given m samples {zj}m j=1 from another distribution Q , the objective is to design the that minimizes the test risk
Conclusion
  • Conclusion & Future Work

    In this work, the authors theoretically study the unweighted subsampling with IF, propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods.
  • Since the framework is applicable both for convex and non-convex models, the authors can further testify its performance on those non-convex models, e.g. Deep Neural Networks.
  • Another direction is to develop better approaches to deal with the over fitting issue, e.g. build a validation set selection scheme.
  • The authors plan to implement the method in industry in the future
Summary
  • Introduction:

    Bigger data can probably train a better model.
  • It is almost the common sense nowadays for machine learning and deep learning practitioners.
  • In the classical Empirical Risk Minimization (ERM) theory, it assumes that the training samples and test samples are i.i.d drawn from the same distribution.
  • The distribution shift from P to Q violates the ERM’s basic assumption; 2) unknown noise in data and its label is common in reality, causing some examples harmful for model’s performance (Szegedy et al 2014; Zhang et al 2019); 3) training on large data sets imposes significant burden on computation, some large-scale deep learning models require hundreds even thousands of GPUs
  • Methods:

    The key challenge lies on that the θmay not be the best risk minimizer corresponding to Q , due to the distribution shift between P and Q and unknown noisy samples in the training set.
  • The essential philosophy here is, given a test distribution Q , some samples in training set cause increasing test risk.
  • If they are downweighted, the authors can have less test risk than before, namely Rθ(Q ) ≤ Rθ(Q ) where the θis the new model learned after some harmful samples are downweighted.
  • Given m samples {zj}m j=1 from another distribution Q , the objective is to design the that minimizes the test risk
  • Conclusion:

    Conclusion & Future Work

    In this work, the authors theoretically study the unweighted subsampling with IF, propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods.
  • Since the framework is applicable both for convex and non-convex models, the authors can further testify its performance on those non-convex models, e.g. Deep Neural Networks.
  • Another direction is to develop better approaches to deal with the over fitting issue, e.g. build a validation set selection scheme.
  • The authors plan to implement the method in industry in the future
Tables
  • Table1: Main notation. Data sets statistics
  • Table2: Average logloss evaluated on the out-of-sample Te set when sampling ratio is set to 95%
  • Table3: Time costs of computing the influence function on whole training set
Download tables as Excel
Related work
  • There are two main ideas to cope with the ERM challenges aforementioned: 1) pessimistic method that tries to learn model robust to noise or bad examples, including l2norm regularization, Adaboost (Freund and Schapire 1997), hard example mining (Malisiewicz, Gupta, and Efros 2011) and focal loss (Lin et al 2017); and 2) optimistic method that modifies the input distribution directly. There are several genres of optimistic methods: the example reweighting method is used for dealing with distribution shift by (Bagnell 2005; Hu et al 2016), and handling data bias by (Kumar, Packer, and Koller 2010; Ren et al 2018); the sample selection method is applied to inspect and fix mislabeled data by (Zhang, Zhu, and Wright 2018). However, few of them have worked on alleviating the computational burden in terms of big data.

    In order to reduce computation, the weighted subsampling method has been explored to approximate the maximum likelihood with a subset on the logistic regression (Fithian and Hastie 2014; Wang, Zhu, and Ma 2018), and on the generalized linear models (Ai et al 2018). (Ting and Brochu 2018) introduces the IF in weighted subsampling for asymptotically optimal sampling probabilities for several generalized linear models. However, it is still an open problem about how to treat high variance of weight terms for weighted subsampling.

    Specifically, the IF is defined by the Gateaux derivatives within the scope of Robust Statistics (Huber 2011), and extended to measure example-wise influence (Koh and Liang 2017) and feature-wise influence (Sliwinski, Strobel, and Zick 2019) on validation loss. The family of IF is mainly applied to design adversarial example and explain behaviour of black-box model previously. Recently, the IF on validation loss is used for targeting important samples, (Wang, Huan, and Li 2018) builds a sample selection scheme on Deep Convolutional Networks (CNN), and (Sharchilev et al 2018) builds specific influential samples selection algorithm for Gradient Boosted Decision Trees (GBDT). However, by far there has no systematic theorem to guide IF’s use in subsampling. Our work tries to build theoretical guidance for IFbased subsampling, which combines reweighting and subsampling together to synthetically cope with ERM’s challenges, e.g. distribution shift and noisy data.
Funding
  • The research of Shao-Lun Huang was funded by the Natural Science Foundation of China 61807021, Shenzhen Science and Technology Research and Development Funds (JCYJ20170818094022586), and Innovation and entrepreneurship project for overseas high-level talents of Shenzhen (KQJSCX20180327144037831)
Reference
  • [Agarwal, Bullins, and Hazan 2017] Agarwal, N.; Bullins, B.; and Hazan, E. 2017. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research 18(1):4148–4187.
    Google ScholarLocate open access versionFindings
  • [Ai et al. 2018] Ai, M.; Yu, J.; Zhang, H.; and Wang, H. 2018. Optimal subsampling algorithms for big data generalized linear models. arXiv preprint arXiv:1806.06761.
    Findings
  • [Bagnell 2005] Bagnell, J. A. 2005. Robust supervised learning. In Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA, 714–719.
    Google ScholarLocate open access versionFindings
  • [Cook and Weisberg 1980] Cook, R. D., and Weisberg, S. 1980. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22(4):495–508.
    Google ScholarLocate open access versionFindings
  • [Duchi and Namkoong 2018] Duchi, J. C., and Namkoong, H. 2018. Learning models with uniform performance via distributionally robust optimization. ArXiv abs/1810.08750.
    Findings
  • [Fithian and Hastie 2014] Fithian, W., and Hastie, T. 2014. Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42(5):1693.
    Google ScholarLocate open access versionFindings
  • [Freund and Schapire 1997] Freund, Y., and Schapire, R. E. 199A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119–139.
    Google ScholarLocate open access versionFindings
  • [Hsia, Chiang, and Lin 2018] Hsia, C.-Y.; Chiang, W.-L.; and Lin, C.-J. 201Preconditioned conjugate gradient methods in truncated newton frameworks for large-scale linear classification. In Asian Conference on Machine Learning, 312–326.
    Google ScholarLocate open access versionFindings
  • [Hu et al. 2016] Hu, W.; Niu, G.; Sato, I.; and Sugiyama, M. 2016. Does distributionally robust supervised learning give robust classifiers? In ICML.
    Google ScholarFindings
  • [Hu et al. 2018] Hu, W.; Niu, G.; Sato, I.; and Sugiyama, M. 2018. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, 2034–2042.
    Google ScholarLocate open access versionFindings
  • [Huber 2011] Huber, P. J. 20Robust statistics. Springer.
    Google ScholarFindings
  • [Koh and Liang 2017] Koh, P. W., and Liang, P. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1885–1894. JMLR. org.
    Google ScholarLocate open access versionFindings
  • [Kumar, Packer, and Koller 2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In 24th Annual Conference on Neural Information Processing Systems 2010 (NIPS), 1189–1197.
    Google ScholarLocate open access versionFindings
  • [Lin et al. 2017] Lin, T.; Goyal, P.; Girshick, R. B.; He, K.; and Dollar, P. 2017. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2999–3007.
    Google ScholarLocate open access versionFindings
  • [Malisiewicz, Gupta, and Efros 2011] Malisiewicz, T.; Gupta, A.; and Efros, A. A. 2011. Ensemble of exemplar-svms for object detection and beyond. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, 89–96.
    Google ScholarLocate open access versionFindings
  • [Martens 2010] Martens, J. 2010. Deep learning via hessian-free optimization. In ICML, volume 27, 735–742.
    Google ScholarLocate open access versionFindings
  • [Nash 1985] Nash, S. G. 1985. Preconditioning of truncatednewton methods. SIAM Journal on Scientific and Statistical Computing 6(3):599–616.
    Google ScholarLocate open access versionFindings
  • [Ren et al. 2018] Ren, M.; Zeng, W.; Yang, B.; and Urtasun, R. 20Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050.
    Findings
  • [Schnabel et al. 2016] Schnabel, T.; Swaminathan, A.; Singh, A.; Chandak, N.; and Joachims, T. 2016. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint arXiv:1602.05352.
    Findings
  • [Sharchilev et al. 2018] Sharchilev, B.; Ustinovsky, Y.; Serdyukov, P.; and de Rijke, M. 2018. Finding influential training samples for gradient boosted decision trees. arXiv preprint arXiv:1802.06640.
    Findings
  • [Sliwinski, Strobel, and Zick 2019] Sliwinski, J.; Strobel, M.; and Zick, Y. 2019. Axiomatic characterization of data-driven influence measures for classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, 718–725.
    Google ScholarLocate open access versionFindings
  • [Szegedy et al. 2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
    Google ScholarLocate open access versionFindings
  • [Ting and Brochu 2018] Ting, D., and Brochu, E. 2018. Optimal subsampling with influence functions. In Advances in Neural Information Processing Systems, 3650–3659.
    Google ScholarLocate open access versionFindings
  • [Wang, Huan, and Li 2018] Wang, T.; Huan, J.; and Li, B. 2018. Data dropout: Optimizing training data for convolutional neural networks. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), 39–46. IEEE.
    Google ScholarLocate open access versionFindings
  • [Wang, Zhu, and Ma 2018] Wang, H.; Zhu, R.; and Ma, P. 2018. Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522):829–844.
    Google ScholarLocate open access versionFindings
  • [Zhang et al. 2019] Zhang, H.; Yu, Y.; Jiao, J.; Xing, E. P.; Ghaoui, L. E.; and Jordan, M. I. 2019. Theoretically principled trade-off between robustness and accuracy. In ICML.
    Google ScholarFindings
  • [Zhang, Zhu, and Wright 2018] Zhang, X.; Zhu, X.; and Wright, S. J. 2018. Training set debugging using trusted items. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), 4482–4489.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments