## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Competitive Distribution Estimation: Why is Good-Turing Good

Annual Conference on Neural Information Processing Systems, (2015)

EI

Abstract

Estimating distributions over large alphabets is a fundamental machine-learning tenet. Yet no method is known to estimate all distributions well. For example, add-constant estimators are nearly min-max optimal but often perform poorly in practice, and practical estimators such as absolute discounting, Jelinek-Mercer, and Good-Turing are n...More

Code:

Data:

Introduction

- 1.1 Background

Many learning applications, ranging from language-processing staples such as speech recognition and machine translation to biological studies in virology and bioinformatics, call for estimating large discrete distributions from their samples. - Min-max performance can be viewed as regret relative to an oracle that knows the underlying distribution.
- The second estimator is designed with exact knowledge of the distribution, but like all natural estimators, forced to assign the same probabilities to symbols appearing the same number of times.

Highlights

- 1.1 Background

Many learning applications, ranging from language-processing staples such as speech recognition and machine translation to biological studies in virology and bioinformatics, call for estimating large discrete distributions from their samples - Probability estimation over large alphabets has long been the subject of extensive research, both by practitioners deriving practical estimators [1, 2], and by theorists searching for optimal estimators [3]
- The add-constant estimators frequently analyzed by theoreticians are nearly min-max optimal, yet perform poorly for many practical distributions, while common practical estimators, such as absolute discounting [4], Jelinek-Mercer [5], and Good-Turing [6], are not well understood and lack provable performance guarantees
- The most natural and important collection of distributions, and the one we study here, is the set of all discrete distributions over an alphabet of some size k, which without loss of generality we assume to be [k] = {1, 2, . . . k}
- We show that certain variations of the Good-Turing estimators, designed without any prior knowledge, approach the performance of both prior-knowledge estimators for every underlying distribution
- Since Laplace is the optimal estimator when the underlying distribution is generated from the uniform prior, it performs well in Figure 2(e), performs poorly on other distributions

Results

- The authors first define the performance of an oracle-aided estimator, designed with some knowledge of the underlying distribution.
- Considering the collection ∆k of all distributions over [k], it follows that as the authors start with single-part partition {∆k} and keep refining it till the oracle knows p, the competitive regret of estimators will increase from 0 to rn(q, ∆k).
- The authors' second comparison is with an estimator designed with exact knowledge of p, but forced to be natural, namely, to assign the same probability to all symbols appearing the same number of times in the sample.
- By carefully constructing distribution classes, the authors lower bound the competitive regret relative to the oracle-aided estimators.
- Observe that while for some probability multisets the regret approaches the log(k/n) min-max upper bound, for other probability multisets it is much lower, and for some, such as uniform over 1 or over k symbols, where the probability multiset determines the distribution it is even 0.
- By [13, Lemmas 10 and 11], for symbols appearing t times, if φt+1 ≥ Ω (t), the Good-Turing estimate is close to the underlying total probability mass, otherwise the empirical estimate is closer.
- The Laplace estimator, βtL = 1 ∀ t, minimizes the expected loss when the underlying distribution is generated by a uniform prior over ∆k.
- The Krichevsky-Trofimov estimator, βtKT = 1/2 ∀ t, is asymptotically min-max optimal for the cumulative regret, and minimizes the expected loss when the underlying distribution is generated according to a Dirichlet-1/2 prior.
- Snx φnx achieves the lowest loss of any natural estimator designed with knowledge of the underlying distribution.

Conclusion

- Since Laplace is the optimal estimator when the underlying distribution is generated from the uniform prior, it performs well in Figure 2(e), performs poorly on other distributions.
- Even though for distributions generated by Dirichlet priors, all the estimators have similar looking regrets (Figures 2(e), 2(f)), the proposed estimator performs better than estimators which are not designed for that prior.
- The authors relate the regret in estimating distribution to that of estimating the combined or total probability mass, defined as follows.

Reference

- William A. Gale and Geoffrey Sampson. Good-turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3):217–237, 1995.
- S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL, 1996.
- Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. In NIPS, 2004.
- Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38, 1994.
- Fredrick Jelinek and Robert L. Mercer. Probability distribution estimation from sparse data. IBM Tech. Disclosure Bull., 1984.
- Irving J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237–264, 1953.
- Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.
- R. Krichevsky. Universal Compression and Retrieval. Dordrecht,The Netherlands: Kluwer, 1994.
- Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In COLT, 2015.
- Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187–206, 2004.
- David A. McAllester and Robert E. Schapire. On the convergence rate of Good-Turing estimators. In COLT, 2000.
- Evgeny Drukh and Yishay Mansour. Concentration bounds for unigrams language model. In COLT, 2004.
- Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal probability estimation with applications to prediction and classification. In COLT, 2013.
- Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always Good Turing: Asymptotically optimal probability estimation. In FOCS, 2003.
- Boris Yakovlevich Ryabko. Twice-universal coding. Problemy Peredachi Informatsii, 1984.
- Boris Yakovlevich Ryabko. Fast adaptive coding algorithm. Problemy Peredachi Informatsii, 26(4):24–
- Dominique Bontemps, Stephane Boucheron, and Elisabeth Gassiat. About adaptive coding on countable alphabets. IEEE Transactions on Information Theory, 60(2):808–821, 2014.
- Stephane Boucheron, Elisabeth Gassiat, and Mesrob I. Ohannessian. About adaptive coding on countable alphabets: Max-stable envelope classes. CoRR, abs/1402.6305, 2014.
- David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
- Felix Abramovich, Yoav Benjamini, David L Donoho, and Iain M Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 2006.
- Peter J Bickel, Chris A Klaassen, YA’Acov Ritov, and Jon A Wellner. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.
- Andrew Barron, Lucien Birge, and Pascal Massart. Risk bounds for model selection via penalization. Probability theory and related fields, 113(3):301–413, 1999.
- Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2004.
- Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. COLT, 2011.
- Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Theertha Suresh. Competitive classification and closeness testing. In COLT, 2012.
- Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. A competitive test for uniformity of monotone distributions. In AISTATS, 2013.
- Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.
- Gregory Valiant and Paul Valiant. Instance optimal learning. CoRR, abs/1504.05321, 2015.
- Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn