# Towards a Combinatorial Characterization of Bounded-Memory Learning

NeurIPS 2020, 2020.

Keywords:

Weibo:

Abstract:

Combinatorial dimensions play an important role in the theory of machine learning. For example, VC dimension characterizes PAC learning, SQ dimension characterizes weak learning with statistical queries, and Littlestone dimension characterizes online learning. In this paper we aim to develop combinatorial dimensions that characterize boun...More

Code:

Data:

Introduction

- Characterization of different learning tasks using a combinatorial condition has been investigated in depth in machine learning.
- 2. If the class C is PAC-learnable under P with accuracy 0.99 using b bits and m sample√s, for every distribution Q ∈ PΘ(1)(P ) its SQ dimension is bounded by SQQ(C) ≤ max(poly(m), 2O( b)).
- A class C under a distribution P is learnable with bounded memory with accuracy 1 − if there is a learning algorithm that uses only m = (|C|/ )o(1) samples and b = o(log |C|(log |X | + log(1/ ))) bits4.

Highlights

- Characterization of different learning tasks using a combinatorial condition has been investigated in depth in machine learning
- Learning a class in an unconstrained fashion is characterized by a finite VC dimension [8,38], and weakly learning in the statistical query (SQ) framework is characterized by a small SQ dimension [6]
- Is there a simple combinatorial condition that characterizes learnability with bounded memory? In this paper we propose a candidate condition, prove upper and lower bounds that match in some of the regime of parameters, and conjecture that they match in a much wider regime of parameters
- If the class C is PAC-learnable under P with accuracy 0.99 using b bits and m sample√s, for every distribution Q ∈ PΘ(1)(P ) its SQ dimension is bounded by SQQ(C) ≤ max(poly(m), 2O( b))
- We can transform any improper learner into a proper learner without significantly increasing the neither the sample nor the space complexity
- We prove similar conditions for SQ learning, implying equivalence between bounded memory learning and SQ learning for small enough

Results

- The authors state the main results for a combinatorial characterization of bounded memory PAC learning in terms of the SQ dimension of distributions close to the underlying distribution.
- There exists an algorithm that learns the class C with accuracy 1 − under the distribution P using b = O(log(d/ ) · log |C|) bits and m = poly(d/ ) · log(|C|) · log log(|C|) samples.
- Recall that the class is bounded memory learnable if there is a learning algorithm with sample complexity m = N o(1) and space complexity b = o(log2 N ).
- For any , the class C is bounded memory learnable under distribution P with accuracy 1 − ⇐⇒ ∀Q ∈ Ppoly(1/ )(P ), SQQ(C) ≤ poly(1/ ).
- There exists an SQ-learner that learns the class C with accuracy 1 − under the distribution P using q = poly(d/ ) statistical queries with tolerance τ ≥ poly( /d).
- In Section 3 the authors construct learning algorithms based on the assumption that close distributions have bounded SQ dimensions, and prove Theorem 5 and Theorem 8.
- To prove Theorem 6, the authors would like to use a recent result by [17] that establishes an upper bound on SQQ(C) given memory-efficient learner.

Conclusion

- The few claims establish the fact that if a class C is learnable with bounded memory under distribution Q, the statistical dimension SQQ(C) is low.
- Assume that the concept class C can be learned with accuracy 1 − 0.1 , m samples, and b bits under distribution P .
- Lemma 17 states that any probability distribution Q that is (1/ )-close to P can be learned with accuracy 0.9, O(m/ 2) samples, and b bits.

Summary

- Characterization of different learning tasks using a combinatorial condition has been investigated in depth in machine learning.
- 2. If the class C is PAC-learnable under P with accuracy 0.99 using b bits and m sample√s, for every distribution Q ∈ PΘ(1)(P ) its SQ dimension is bounded by SQQ(C) ≤ max(poly(m), 2O( b)).
- A class C under a distribution P is learnable with bounded memory with accuracy 1 − if there is a learning algorithm that uses only m = (|C|/ )o(1) samples and b = o(log |C|(log |X | + log(1/ ))) bits4.
- The authors state the main results for a combinatorial characterization of bounded memory PAC learning in terms of the SQ dimension of distributions close to the underlying distribution.
- There exists an algorithm that learns the class C with accuracy 1 − under the distribution P using b = O(log(d/ ) · log |C|) bits and m = poly(d/ ) · log(|C|) · log log(|C|) samples.
- Recall that the class is bounded memory learnable if there is a learning algorithm with sample complexity m = N o(1) and space complexity b = o(log2 N ).
- For any , the class C is bounded memory learnable under distribution P with accuracy 1 − ⇐⇒ ∀Q ∈ Ppoly(1/ )(P ), SQQ(C) ≤ poly(1/ ).
- There exists an SQ-learner that learns the class C with accuracy 1 − under the distribution P using q = poly(d/ ) statistical queries with tolerance τ ≥ poly( /d).
- In Section 3 the authors construct learning algorithms based on the assumption that close distributions have bounded SQ dimensions, and prove Theorem 5 and Theorem 8.
- To prove Theorem 6, the authors would like to use a recent result by [17] that establishes an upper bound on SQQ(C) given memory-efficient learner.
- The few claims establish the fact that if a class C is learnable with bounded memory under distribution Q, the statistical dimension SQQ(C) is low.
- Assume that the concept class C can be learned with accuracy 1 − 0.1 , m samples, and b bits under distribution P .
- Lemma 17 states that any probability distribution Q that is (1/ )-close to P can be learned with accuracy 0.9, O(m/ 2) samples, and b bits.

Related work

- Characterization of bounded memory learning. Many works have proved lower bounds under memory constraints [3, 10, 11, 16, 17, 22, 24, 25, 27, 28, 32, 33]. Some of these works even provide a necessary condition for learnability with bounded memory. As for upper bounds, not many works have tried to give a general property that implies learnability under memory constraints. One work suggested such property [26] but this did not lead to a full characterization of bounded memory learning.

Statistical query learning. After Kearns’s introduction of statistical query [20], Blum et al [6] characterized weak learnability using SQ dimension. Specifically, if SQP (C) = d, then poly(d) queries are both needed and sufficient to learn with accuracy 1/2 + poly(1/d). Note that the advantage is very small, only poly(1/d). Subsequently several works [2, 12, 34, 36] suggested a few characterizations of strong SQ learnability.

Reference

- Javed A Aslam and Scott E Decatur. General bounds on statistical query learning and PAC learning with noise via hypothesis boosting. In Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science, pages 282–29IEEE, 1993.
- Jose L Balcazar, Jorge Castro, David Guijarro, Johannes Kobler, and Wolfgang Lindner. A general dimension for query learning. Journal of Computer and System Sciences, 73(6):924–940, 2007.
- Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In Conference On Learning Theory, pages 843–856, 2018.
- Shai Ben-David, Tyler Lu, and David Pal. Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In COLT, pages 33–44, 2008.
- Gyora M Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoretical Computer Science, 86(2):377–389, 1991.
- Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning DNF and characterizing statistical query learning using fourier analysis. In STOC, volume 94, pages 253–262, 1994.
- Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Vorabversion eines Lehrbuchs, 2016.
- Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- Nader H Bshouty and Dmitry Gavinsky. On boosting with polynomially bounded distributions. Journal of Machine Learning Research, 3(Nov):483–506, 2002.
- Yuval Dagan, Gil Kur, and Ohad Shamir. Space lower bounds for linear prediction in the streaming model. In Conference on Learning Theory, pages 929–954, 2019.
- Yuval Dagan and Ohad Shamir. Detecting correlations with little memory and communication. In Conference On Learning Theory, pages 1145–1198, 2018.
- Vitaly Feldman. A complete characterization of statistical query learning with applications to evolvability. Journal of Computer and System Sciences, 78(5):1444–1459, 2012.
- Yoav Freund. An improved boosting algorithm and its implications on learning complexity. In Proceedings of the fifth annual workshop on Computational learning theory, pages 391–398. ACM, 1992.
- Yoav Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256– 285, 1995.
- Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 990–1002. ACM, 2018.
- Sumegha Garg, Ran Raz, and Avishay Tal. Time-space lower bounds for two-pass learning. In 34th Computational Complexity Conference (CCC 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
- Russell Impagliazzo. Hard-core distributions for somewhat hard problems. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 538–545. IEEE, 1995.
- Jeffrey C Jackson. The harmonic sieve: A novel application of fourier analysis to machine learning theory and practice. Technical report, Carnegie Mellon University Pittsburgh School Of Computer Science, 1995.
- Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- Adam R Klivans and Rocco A Servedio. Boosting and hard-core sets. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 624–633. IEEE, 1999.
- Gillat Kol, Ran Raz, and Avishay Tal. Time-space hardness of learning sparse parities. In Proc. 49th ACM Symp. on Theory of Computing, 2017.
- Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.
- Dana Moshkovitz and Michal Moshkovitz. Mixing implies lower bounds for space bounded learning. In Conference on Learning Theory, pages 1516–1566, 2017.
- Dana Moshkovitz and Michal Moshkovitz. Entropy samplers and strong generic lower bounds for space bounded learning. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- Michal Moshkovitz and Naftali Tishby. A general memory-bounded learning algorithm. arXiv preprint arXiv:1712.03524, 2017.
- Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. In Proc. 57th IEEE Symp. on Foundations of Computer Science, 2016.
- Ran Raz. A time-space lower bound for a large class of learning problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–742. IEEE, 2017.
- Sivan Sabato, Nathan Srebro, and Naftali Tishby. Distribution-dependent sample complexity of large margin learning. The Journal of Machine Learning Research, 14(1):2119–2149, 2013.
- Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990.
- Robert E Schapire. The design and analysis of efficient learning algorithms. Technical report, Massachusetts Inst Of Tech Cambridge Lab For Computer Science, 1991.
- O. Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 163–171, 2014.
- Vatsal Sharan, Aaron Sidford, and Gregory Valiant. Memory-sample tradeoffs for linear regression with small error. arXiv preprint arXiv:1904.08544, 2019.
- Hans Ulrich Simon. A characterization of strong learnability in the statistical query model. In Annual Symposium on Theoretical Aspects of Computer Science, pages 393–404.
- Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries. In Conference on Learning Theory, pages 1490–1516, 2016.
- Balazs Szorenyi. Characterizing statistical query learning: simplified notions and proofs. In International Conference on Algorithmic Learning Theory, pages 186–200.
- Leslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 436–445. ACM, 1984.
- Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30.
- Nicolas Vayatis and Robert Azencott. Distribution-dependent vapnik-chervonenkis bounds. In European Conference on Computational Learning Theory, pages 230–240.
- Ke Yang. On learning correlated boolean functions using statistical queries. In International Conference on Algorithmic Learning Theory, pages 59–76.
- Ke Yang. New lower bounds for statistical query learning. Journal of Computer and System Sciences, 70(4):485–509, 2005.
- 2. Go over all hypothesis in C and return one that agrees with h on 1 − 2 of the examples by testing consistency on O(log |C|/ 2) random examples.

Tags

Comments