Semi-supervised learning of class balance under class-prior change by distribution matching

Neural networks : the official journal of the International Neural Network Society, Volume 50, 2014, Pages 110-119.

Cited by: 98|Bibtex|Views178|Links
EI WOS
Keywords:
class-prior changetest input datainstance re-weightingsemi-supervised learningsystematical bias correctionMore(8+)
Weibo:
We showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy

Abstract:

In real-world classification problems, the class balance in the training dataset does not necessarily reflect that of the test dataset, which can cause significant estimation bias. If the class ratio of the test dataset is known, instance re-weighting or resampling allows systematical bias correction. However, learning the class ratio of ...More

Code:

Data:

0
Introduction
Highlights
  • Most supervised learning algorithms assume that training and test data follow the same probability distribution (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998)
  • This de facto standard assumption is often violated in real-world problems, caused by intrinsic sample selection bias or inevitable non-stationarity (Heckman, 1979; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009; Sugiyama & Kawanabe, 2012)
  • Such a situation is called a class-prior change, and the bias caused by differing class balances can be systematically adjusted by instance re-weighting or resampling if the class balance in the test dataset is known (Elkan, 2001; Lin, Lee, & Wahba, 2002)
  • The class ratio in the test dataset is often unknown in practice
  • Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known
  • We showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Methods
  • The authors report experimental results.

    5.1. Benchmark datasets

    The following five methods are compared:

    EM-KLR: The method of Saerens et al (2001) (see Section 2.2).

    The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with

    Gaussian kernels.
  • The following five methods are compared:.
  • EM-KLR: The method of Saerens et al (2001).
  • The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with.
  • Gaussian kernels.
  • The L-BFGS quasi-Newton implementation included in the ‘minFunc’ package is used for logistic regression training (Schmidt, 2005).
  • KL–KDE: The estimator of the KL divergence KL(p′ ∥ q′) using kernel density estimation (KDE).
  • The class-wise input densities are estimated by KDE with Gaussian kernels.
  • The kernel widths are estimated using likelihood cross-validation (Silverman, 1986)
Conclusion
  • Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known.
  • The authors first showed that the EM-based estimator introduced in Saerens et al (2001) can be regarded as indirectly approximating the test input distribution by a linear combination of classwise input distributions
  • Based on this view, the authors proposed to use an explicit and possibly more accurate divergence estimator based on density-ratio estimation (Kanamori et al, 2009) for learning test class-priors.
  • The authors showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Summary
  • Introduction:

    Most supervised learning algorithms assume that training and test data follow the same probability distribution (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998)
  • This de facto standard assumption is often violated in real-world problems, caused by intrinsic sample selection bias or inevitable non-stationarity (Heckman, 1979; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009; Sugiyama & Kawanabe, 2012).
  • Objectives:

    The goal of this paper is to estimate p′(y) from labeled training samples {}ni=1 drawn independently from p(x, y) and unlabeled test samples x′i ni=′ 1 drawn.
  • Methods:

    The authors report experimental results.

    5.1. Benchmark datasets

    The following five methods are compared:

    EM-KLR: The method of Saerens et al (2001) (see Section 2.2).

    The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with

    Gaussian kernels.
  • The following five methods are compared:.
  • EM-KLR: The method of Saerens et al (2001).
  • The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with.
  • Gaussian kernels.
  • The L-BFGS quasi-Newton implementation included in the ‘minFunc’ package is used for logistic regression training (Schmidt, 2005).
  • KL–KDE: The estimator of the KL divergence KL(p′ ∥ q′) using kernel density estimation (KDE).
  • The class-wise input densities are estimated by KDE with Gaussian kernels.
  • The kernel widths are estimated using likelihood cross-validation (Silverman, 1986)
  • Conclusion:

    Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known.
  • The authors first showed that the EM-based estimator introduced in Saerens et al (2001) can be regarded as indirectly approximating the test input distribution by a linear combination of classwise input distributions
  • Based on this view, the authors proposed to use an explicit and possibly more accurate divergence estimator based on density-ratio estimation (Kanamori et al, 2009) for learning test class-priors.
  • The authors showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Tables
  • Table1: Datasets used in the experiments. Source: The SAHeart dataset was taken from <a class="ref-link" id="cHastie_et+al_2001_a" href="#rHastie_et+al_2001_a">Hastie et al (2001</a>). All other datasets were taken from the LIBSVM webpage (<a class="ref-link" id="cChang_2011_a" href="#rChang_2011_a">Chang & Lin, 2011</a>)
Download tables as Excel
Reference
  • Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B, 28, 131–142.
    Google ScholarLocate open access versionFindings
  • Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3), 549–559.
    Google ScholarLocate open access versionFindings
  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY, USA: Springer.
    Google ScholarFindings
  • Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York, NY, USA: Cambridge University Press.
    Google ScholarFindings
  • Chan, Y. S., & Ng, H. T. (2006). Estimating class priors in domain adaptation for word sense disambiguation. In Proceedings of the 21st international conference on computational linguistics (pp. 89–96).
    Google ScholarLocate open access versionFindings
  • Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software Available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    Locate open access versionFindings
  • Chapelle, O., Schölkopf, B., & Zien, A. (Eds.) (2006). Semi-supervised learning. Cambridge, MA, USA: MIT Press.
    Google ScholarLocate open access versionFindings
  • Clémençon, S., Vayatis, N., & Depecker, M. (2009). AUC optimization and the twosample problem. In Advances in neural information processing systems, Vol. 22 (pp. 360–368).
    Google ScholarLocate open access versionFindings
  • Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. In Advances in neural information processing systems, Vol. 16 (pp. 313–320). Cambridge, MA: MIT Press.
    Google ScholarLocate open access versionFindings
  • Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.
    Google ScholarLocate open access versionFindings
  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
    Google ScholarLocate open access versionFindings
  • Duarte, M. F., & Hu, Y. H. (2004). Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7), 826–838.
    Google ScholarLocate open access versionFindings
  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York, NY, USA: Wiley.
    Google ScholarFindings
  • du Plessis, M. C., & Sugiyama, M. (2012). Semi-supervised learning of class balance under class-prior change by distribution matching. In J. Langford, & J. Pineau (Eds.), Proceedings of 29th international conference on machine learning, ICML2012. Edinburgh, Scotland, June 26–July 1 (pp. 823–830).
    Google ScholarLocate open access versionFindings
  • Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the seventeenth international joint conference on artificial intelligence (pp. 973–978).
    Google ScholarLocate open access versionFindings
  • Hall, P. (1981). On the non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 147–156.
    Google ScholarLocate open access versionFindings
  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. New York, NY, USA: Springer.
    Google ScholarFindings
  • Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
    Google ScholarLocate open access versionFindings
  • Hunter, J., & Nachtergaele, B. (2001). Applied analysis. River Edge, NY, USA: World Scientific Inc. Co.
    Google ScholarFindings
  • Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445.
    Google ScholarLocate open access versionFindings
  • Kanamori, T., Suzuki, T., & Sugiyama, M. (2013). Computational complexity of kernel-based density-ratio estimation: a condition number analysis. Machine Learning, 90(3), 431–460.
    Google ScholarLocate open access versionFindings
  • Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3), 335–367.
    Google ScholarLocate open access versionFindings
  • Keziou, A. (2003). Dual representation of φ-divergences and applications. Comptes
    Google ScholarLocate open access versionFindings
  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of
    Google ScholarLocate open access versionFindings
  • Latinne, P., Saerens, M., & Decaestecker, C. (2001). Adjusting the outputs of a classifier to new a priori probabilities may significantly improve classification accuracy: evidence from a multi-class problem in remote sensing. In Proceedings of the 18th international conference on machine learning (pp. 298–305).
    Google ScholarLocate open access versionFindings
  • Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46(1), 191–202.
    Google ScholarLocate open access versionFindings
  • McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York, NY, USA: John Wiley and Sons. Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.
    Google ScholarLocate open access versionFindings
  • Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175.
    Google ScholarLocate open access versionFindings
  • Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge, MA, USA: MIT Press.
    Google ScholarLocate open access versionFindings
  • Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ, USA: Princeton University Press.
    Google ScholarFindings
  • Saerens, M., Latinne, P., & Decaestecker, C. (2001). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14, 21–41.
    Google ScholarLocate open access versionFindings
  • Schmidt, M. (2005). minFunc—unconstrained differentiable multivariate optimization in MATLAB.
    Google ScholarFindings
  • Silverman, B. W. (1986). Density estimation: for statistics and data analysis. London, UK: Chapman and Hall.
    Google ScholarFindings
  • Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D, 2690–2701.
    Google ScholarLocate open access versionFindings
  • Sugiyama, M., & Kawanabe, M. (2012). Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cambridge, MA, USA: MIT Press.
    Google ScholarFindings
  • Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985–1005.
    Google ScholarLocate open access versionFindings
  • Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge, UK: Cambridge University Press.
    Google ScholarFindings
  • Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5), 1009–1044.
    Google ScholarLocate open access versionFindings
  • Sugiyama, M., Suzuki, T., Kanamori, T., du Plessis, M. C., Liu, S., & Takeuchi, I. (2013). Density-difference estimation. Neural Computation, 25(10), 2734–2775.
    Google ScholarLocate open access versionFindings
  • Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4), 699–746.
    Google ScholarLocate open access versionFindings
  • Titterington, D. (1983). Minimum distance non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 37–46.
    Google ScholarLocate open access versionFindings
  • Van Trees, H. (1968). Detection, estimation, and modulation theory, part I. Detection, estimation, and modulation theory. New York, NY, USA: John Wiley and Sons.
    Google ScholarFindings
  • Vapnik, V. N. (1998). Statistical learning theory. New York, NY, USA: Wiley.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments