# Distribution-Independent PAC Learning of Halfspaces with Massart Noise

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), pp. 4751-4762, 2019.

EI

Keywords:

marginal distribution

Weibo:

Abstract:

We study the problem of distribution -independent PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples (x, y) drawn from a distribution 7) on Rd+1 such that the marginal distribution on the unlabeled points x is arbitrary and the labels y are generated by an unknown halfspace co...More

Code:

Data:

Introduction

- In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].
- In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
- Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
- A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it

Highlights

- Halfspaces, or Linear Threshold Functions ( LTFs), are Boolean functions f : Rd → {±1} of the form f (x) = sign( w, x − θ), where w ∈ Rd is the weight vector and θ ∈ R is the threshold. (The function sign : R → {±1} is defined as sign(u) = 1 if u ≥ 0 and sign(u) = −1 otherwise.) The problem of learning an unknown halfspace is as old as the field of machine learning — starting with Rosenblatt’s Perceptron algorithm [Ros58] — and has arguably been the most influential problem in the development of the field
- We focus on learning halfspaces with Massart noise [MN06]: Definition 1.1 (Massart Noise Model)
- The most obvious open problem is whether this error guarantee can be improved to f (OPT) + (for some function f : R → R such that limx→0 f (x) = 0) or, ideally, to OPT +
- It is a plausible conjecture that obtaining better error guarantees is computationally intractable. This is left as an interesting open problem for future work. Another open question is whether there is an efficient proper learner matching the error guarantees of our algorithm
- What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

Results

- The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

See Theorem 2.9 for a more detailed formal statement. - The main result of this paper is the following: Theorem 1.2 (Main Result).
- There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
- See Theorem 2.9 for a more detailed formal statement.
- For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6

Conclusion

- The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.
- (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
- It is a plausible conjecture that obtaining better error guarantees is computationally intractable
- This is left as an interesting open problem for future work.
- Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
- What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

Summary

## Introduction:

In the agnostic model [Hau92, KSS94] – where an adversary is allowed to arbitrarily corrupt an arbitrary η < 1/2 fraction of the labels – even weak learning is known to be computationally intractable [GR06, FGKP06, Dan16].- In the presence of Random Classification Noise (RCN) [AL88] – where each label is flipped independently with probability exactly η < 1/2 – a polynomial time algorithm is known [BFKV96, BFKV97].
- Let C be a class of Boolean functions over X = Rd, Dx be an arbitrary distribution over X, and 0 ≤ η < 1/2.
- A noisy example oracle, EXMas(f, Dx, η), works as follows: Each time EXMas(f, Dx, η) is invoked, it
## Objectives:

The authors' goal is to design a poly(d, 1/ , 1/γ) time learning algorithm in the presence of Massart noise.- The authors' goal is to find a hypothesis classifier h with low misclassification error
## Results:

The main result of this paper is the following: Theorem 1.2 (Main Result). There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η + .

See Theorem 2.9 for a more detailed formal statement.- The main result of this paper is the following: Theorem 1.2 (Main Result).
- There is an algorithm that for all 0 < η < 1/2, on input a set of i.i.d. examples from a distribution D = EXMas(f, Dx, η) on Rd+1, where f is an unknown halfspace on Rd, it runs in poly(d, b, 1/ ) time, where b is an upper bound on the bit complexity of the examples, and outputs a hypothesis h that with high probability satisfies Pr(x,y)∼D[h(x) = y] ≤ η +.
- See Theorem 2.9 for a more detailed formal statement.
- For large-margin halfspaces, the authors obtain a slightly better error guarantee; see Theorem 2.2 and Remark 2.6
## Conclusion:

The authors note that the algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.- (See Section 1.2 for a discussion.)The main contribution of this paper is the first non-trivial learning algorithm for the class of halfspaces in the distribution-free PAC model with Massart noise.
- It is a plausible conjecture that obtaining better error guarantees is computationally intractable
- This is left as an interesting open problem for future work.
- Another open question is whether there is an efficient proper learner matching the error guarantees of the algorithm.
- What other concept classes admit non-trivial algorithms in the Massart noise model? Can one establish non-trivial reductions between the Massart noise model and the agnostic model? And are there other natural semi-random input models that allow for efficient PAC learning algorithms in the distribution-free setting?

Related work

- Bylander [Byl94] gave a polynomial time algorithm to learn large margin halfspaces with RCN (under an additional anti-concentration assumption). The work of Blum et al [BFKV96, BFKV97] gave the first polynomial time algorithm for distribution-independent learning of halfspaces with RCN without any margin assumptions. Soon thereafter, [Coh97] gave a polynomial-time proper learning algorithm for the problem. Subsequently, Dunagan and Vempala [DV04b] gave a rescaled perceptron algorithm for solving linear programs, which translates to a significantly simpler and faster proper learning algorithm.

The term “Massart noise” was coined after [MN06]. An equivalent version of the model was previously studied by Rivest and Sloan [Slo88, Slo92, RS94, Slo96], and a very similar asymmetric random noise model goes back to Vapnik [Vap82]. Prior to this work, essentially no efficient algorithms with non-trivial error guarantees were known in the distribution-free Massart noise model. It should be noted that polynomial time algorithms with error OPT + are known [ABHU15, ZLC17, YZ17] when the marginal distribution on the unlabeled data is uniform on the unit sphere. For the case that the unlabeled data comes from an isotropic log-concave distribution, [ABHZ16] give a d2poly(1/(1−2η)) /poly( ) sample and time algorithm.

Funding

- Ilias Diakonikolas is supported by Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship

Reference

- [ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167–190, 2015.
- [ABHZ16] P. Awasthi, M. F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 152–192, 2016.
- [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1–50:27, 2017.
- [AL88] D. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343–370, 1988.
- [Ber06] T. Bernholt. Robust estimators are hard to compute. Technical report, University of Dortmund, Germany, 2006.
- [BFKV96] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. In 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, pages 330–338, 1996.
- [BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1/2):35–52, 1997.
- [Blu03] A. Blum. Machine learning: My favorite results, directions, and open problems. In 44th Symposium on Foundations of Computer Science (FOCS 2003), pages 11–14, 2003.
- [Byl94] T. Bylander. Learning linear threshold functions in the presence of classification noise. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 1994, pages 340–347, 1994.
- [Coh97] E. Cohen. Learning noisy perceptrons by a perceptron in polynomial time. In Proceedings of the Thirty-Eighth Symposium on Foundations of Computer Science, pages 514–521, 1997.
- [Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105–117, 2016.
- [DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Proceedings of FOCS’16, pages 655–664, 2016.
- [DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pages 999–1008, 2017.
- [DKK+18] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2683–2702, 2018.
- [DKK+19] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, pages 1596–1606, 2019.
- [DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061–1073, 2018.
- [DKS19] I. Diakonikolas, W. Kong, and A. Stewart. Efficient algorithms and lower bounds for robust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, pages 2745–2754, 2019.
- [DKW56] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Mathematical Statistics, 27(3):642–669, 1956.
- [Duc16] J. C. Duchi. Introductory lectures on stochastic convex optimization. Park City Mathematics Institute, Graduate Summer School Lectures, 2016.
- [DV04a] J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. J. Computer & System Sciences, 68(2):335–373, 2004.
- [DV04b] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 315–320, 2004.
- [FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In Proc. FOCS, pages 563–576, 2006.
- [GR06] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc. 47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543–552. IEEE Computer Society, 2006.
- [Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.
- [Kea93] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 392–401, 1993.
- [Kea98] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998.
- [KKM18] A. R. Klivans, P. K. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Conference On Learning Theory, COLT 2018, pages 1420–1430, 2018.
- [KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To appear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming (ICALP), 2009.
- [KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115–141, 1994.
- [LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In Proceedings of FOCS’16, 2016.
- [LS10] P. M. Long and R. A. Servedio. Random classification noise defeats all convex potential boosters. Machine Learning, 78(3):287–304, 2010.
- [MN06] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326– 2366, 10 2006.
- [MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381–414. MIT Press, 1994.
- [Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958.
- [RS94] R. Rivest and R. Sloan. A formal model of hierarchical concept learning. Information and Computation, 114(1):88–114, 1994.
- [Slo88] R. H. Sloan. Types of noise in data for concept learning. In Proceedings of the First Annual Workshop on Computational Learning Theory, COLT ’88, pages 91–96, San Francisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc.
- [Slo92] R. H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, page 450, 1992.
- [Slo96] R. H. Sloan. Pac Learning, Noise, and Geometry, pages 21–41. Birkhäuser Boston, Boston, MA, 1996.
- [Val84] L. G. Valiant. A theory of the learnable. In Proc. 16th Annual ACM Symposium on Theory of Computing (STOC), pages 436–445. ACM Press, 1984.
- [Vap82] V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1982.
- [YZ17] S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 1056–1066, 2017.
- [ZLC17] Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 1980–2022, 2017.

Best Paper

Best Paper of NeurIPS, 2019

Tags

Comments