Semi-supervised learning of class balance under class-prior change by distribution matching
Neural networks : the official journal of the International Neural Network Society, Volume 50, 2014, Pages 110-119.
EI WOS
Keywords:
class-prior changetest input datainstance re-weightingsemi-supervised learningsystematical bias correctionMore(8+)
Weibo:
Abstract:
In real-world classification problems, the class balance in the training dataset does not necessarily reflect that of the test dataset, which can cause significant estimation bias. If the class ratio of the test dataset is known, instance re-weighting or resampling allows systematical bias correction. However, learning the class ratio of ...More
Code:
Data:
Introduction
- Most supervised learning algorithms assume that training and test data follow the same probability distribution (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998)
- This de facto standard assumption is often violated in real-world problems, caused by intrinsic sample selection bias or inevitable non-stationarity (Heckman, 1979; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009; Sugiyama & Kawanabe, 2012).
Highlights
- Most supervised learning algorithms assume that training and test data follow the same probability distribution (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998)
- This de facto standard assumption is often violated in real-world problems, caused by intrinsic sample selection bias or inevitable non-stationarity (Heckman, 1979; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009; Sugiyama & Kawanabe, 2012)
- Such a situation is called a class-prior change, and the bias caused by differing class balances can be systematically adjusted by instance re-weighting or resampling if the class balance in the test dataset is known (Elkan, 2001; Lin, Lee, & Wahba, 2002)
- The class ratio in the test dataset is often unknown in practice
- Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known
- We showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Methods
- The authors report experimental results.
5.1. Benchmark datasets
The following five methods are compared:
EM-KLR: The method of Saerens et al (2001) (see Section 2.2).
The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with
Gaussian kernels. - The following five methods are compared:.
- EM-KLR: The method of Saerens et al (2001).
- The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with.
- Gaussian kernels.
- The L-BFGS quasi-Newton implementation included in the ‘minFunc’ package is used for logistic regression training (Schmidt, 2005).
- KL–KDE: The estimator of the KL divergence KL(p′ ∥ q′) using kernel density estimation (KDE).
- The class-wise input densities are estimated by KDE with Gaussian kernels.
- The kernel widths are estimated using likelihood cross-validation (Silverman, 1986)
Conclusion
- Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known.
- The authors first showed that the EM-based estimator introduced in Saerens et al (2001) can be regarded as indirectly approximating the test input distribution by a linear combination of classwise input distributions
- Based on this view, the authors proposed to use an explicit and possibly more accurate divergence estimator based on density-ratio estimation (Kanamori et al, 2009) for learning test class-priors.
- The authors showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Summary
Introduction:
Most supervised learning algorithms assume that training and test data follow the same probability distribution (Bishop, 2006; Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998)- This de facto standard assumption is often violated in real-world problems, caused by intrinsic sample selection bias or inevitable non-stationarity (Heckman, 1979; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009; Sugiyama & Kawanabe, 2012).
Objectives:
The goal of this paper is to estimate p′(y) from labeled training samples {}ni=1 drawn independently from p(x, y) and unlabeled test samples x′i ni=′ 1 drawn.Methods:
The authors report experimental results.
5.1. Benchmark datasets
The following five methods are compared:
EM-KLR: The method of Saerens et al (2001) (see Section 2.2).
The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with
Gaussian kernels.- The following five methods are compared:.
- EM-KLR: The method of Saerens et al (2001).
- The class-posterior probability of the training dataset is estimated using l2-penalized kernel logistic regression with.
- Gaussian kernels.
- The L-BFGS quasi-Newton implementation included in the ‘minFunc’ package is used for logistic regression training (Schmidt, 2005).
- KL–KDE: The estimator of the KL divergence KL(p′ ∥ q′) using kernel density estimation (KDE).
- The class-wise input densities are estimated by KDE with Gaussian kernels.
- The kernel widths are estimated using likelihood cross-validation (Silverman, 1986)
Conclusion:
Class-prior change is a problem that is conceivable in many real-world datasets, and it can be systematically corrected for if the class prior of the test dataset is known.- The authors first showed that the EM-based estimator introduced in Saerens et al (2001) can be regarded as indirectly approximating the test input distribution by a linear combination of classwise input distributions
- Based on this view, the authors proposed to use an explicit and possibly more accurate divergence estimator based on density-ratio estimation (Kanamori et al, 2009) for learning test class-priors.
- The authors showed that the class ratios estimated by the proposed method are more accurate than competing methods, which can be translated into better classification accuracy
Tables
- Table1: Datasets used in the experiments. Source: The SAHeart dataset was taken from <a class="ref-link" id="cHastie_et+al_2001_a" href="#rHastie_et+al_2001_a">Hastie et al (2001</a>). All other datasets were taken from the LIBSVM webpage (<a class="ref-link" id="cChang_2011_a" href="#rChang_2011_a">Chang & Lin, 2011</a>)
Reference
- Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B, 28, 131–142.
- Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3), 549–559.
- Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY, USA: Springer.
- Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York, NY, USA: Cambridge University Press.
- Chan, Y. S., & Ng, H. T. (2006). Estimating class priors in domain adaptation for word sense disambiguation. In Proceedings of the 21st international conference on computational linguistics (pp. 89–96).
- Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software Available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Chapelle, O., Schölkopf, B., & Zien, A. (Eds.) (2006). Semi-supervised learning. Cambridge, MA, USA: MIT Press.
- Clémençon, S., Vayatis, N., & Depecker, M. (2009). AUC optimization and the twosample problem. In Advances in neural information processing systems, Vol. 22 (pp. 360–368).
- Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. In Advances in neural information processing systems, Vol. 16 (pp. 313–320). Cambridge, MA: MIT Press.
- Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
- Duarte, M. F., & Hu, Y. H. (2004). Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7), 826–838.
- Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York, NY, USA: Wiley.
- du Plessis, M. C., & Sugiyama, M. (2012). Semi-supervised learning of class balance under class-prior change by distribution matching. In J. Langford, & J. Pineau (Eds.), Proceedings of 29th international conference on machine learning, ICML2012. Edinburgh, Scotland, June 26–July 1 (pp. 823–830).
- Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the seventeenth international joint conference on artificial intelligence (pp. 973–978).
- Hall, P. (1981). On the non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 147–156.
- Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. New York, NY, USA: Springer.
- Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161.
- Hunter, J., & Nachtergaele, B. (2001). Applied analysis. River Edge, NY, USA: World Scientific Inc. Co.
- Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445.
- Kanamori, T., Suzuki, T., & Sugiyama, M. (2013). Computational complexity of kernel-based density-ratio estimation: a condition number analysis. Machine Learning, 90(3), 431–460.
- Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3), 335–367.
- Keziou, A. (2003). Dual representation of φ-divergences and applications. Comptes
- Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of
- Latinne, P., Saerens, M., & Decaestecker, C. (2001). Adjusting the outputs of a classifier to new a priori probabilities may significantly improve classification accuracy: evidence from a multi-class problem in remote sensing. In Proceedings of the 18th international conference on machine learning (pp. 298–305).
- Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46(1), 191–202.
- McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York, NY, USA: John Wiley and Sons. Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), 5847–5861.
- Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175.
- Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge, MA, USA: MIT Press.
- Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ, USA: Princeton University Press.
- Saerens, M., Latinne, P., & Decaestecker, C. (2001). Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14, 21–41.
- Schmidt, M. (2005). minFunc—unconstrained differentiable multivariate optimization in MATLAB.
- Silverman, B. W. (1986). Density estimation: for statistics and data analysis. London, UK: Chapman and Hall.
- Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D, 2690–2701.
- Sugiyama, M., & Kawanabe, M. (2012). Machine learning in non-stationary environments: introduction to covariate shift adaptation. Cambridge, MA, USA: MIT Press.
- Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985–1005.
- Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge, UK: Cambridge University Press.
- Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5), 1009–1044.
- Sugiyama, M., Suzuki, T., Kanamori, T., du Plessis, M. C., Liu, S., & Takeuchi, I. (2013). Density-difference estimation. Neural Computation, 25(10), 2734–2775.
- Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4), 699–746.
- Titterington, D. (1983). Minimum distance non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 37–46.
- Van Trees, H. (1968). Detection, estimation, and modulation theory, part I. Detection, estimation, and modulation theory. New York, NY, USA: John Wiley and Sons.
- Vapnik, V. N. (1998). Statistical learning theory. New York, NY, USA: Wiley.
Tags
Comments