AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Probability estimation over large alphabets has long been the subject of extensive research, both by practitioners deriving practical estimators, and by theorists searching for optimal estimators

Competitive Distribution Estimation: Why is Good-Turing Good

Annual Conference on Neural Information Processing Systems, (2015)

Cited by: 71|Views342
EI
Full Text
Bibtex
Weibo

Abstract

Estimating distributions over large alphabets is a fundamental machine-learning tenet. Yet no method is known to estimate all distributions well. For example, add-constant estimators are nearly min-max optimal but often perform poorly in practice, and practical estimators such as absolute discounting, Jelinek-Mercer, and Good-Turing are n...More

Code:

Data:

0
Introduction
  • 1.1 Background

    Many learning applications, ranging from language-processing staples such as speech recognition and machine translation to biological studies in virology and bioinformatics, call for estimating large discrete distributions from their samples.
  • Min-max performance can be viewed as regret relative to an oracle that knows the underlying distribution.
  • The second estimator is designed with exact knowledge of the distribution, but like all natural estimators, forced to assign the same probabilities to symbols appearing the same number of times.
Highlights
  • 1.1 Background

    Many learning applications, ranging from language-processing staples such as speech recognition and machine translation to biological studies in virology and bioinformatics, call for estimating large discrete distributions from their samples
  • Probability estimation over large alphabets has long been the subject of extensive research, both by practitioners deriving practical estimators [1, 2], and by theorists searching for optimal estimators [3]
  • The add-constant estimators frequently analyzed by theoreticians are nearly min-max optimal, yet perform poorly for many practical distributions, while common practical estimators, such as absolute discounting [4], Jelinek-Mercer [5], and Good-Turing [6], are not well understood and lack provable performance guarantees
  • The most natural and important collection of distributions, and the one we study here, is the set of all discrete distributions over an alphabet of some size k, which without loss of generality we assume to be [k] = {1, 2, . . . k}
  • We show that certain variations of the Good-Turing estimators, designed without any prior knowledge, approach the performance of both prior-knowledge estimators for every underlying distribution
  • Since Laplace is the optimal estimator when the underlying distribution is generated from the uniform prior, it performs well in Figure 2(e), performs poorly on other distributions
Results
  • The authors first define the performance of an oracle-aided estimator, designed with some knowledge of the underlying distribution.
  • Considering the collection ∆k of all distributions over [k], it follows that as the authors start with single-part partition {∆k} and keep refining it till the oracle knows p, the competitive regret of estimators will increase from 0 to rn(q, ∆k).
  • The authors' second comparison is with an estimator designed with exact knowledge of p, but forced to be natural, namely, to assign the same probability to all symbols appearing the same number of times in the sample.
  • By carefully constructing distribution classes, the authors lower bound the competitive regret relative to the oracle-aided estimators.
  • Observe that while for some probability multisets the regret approaches the log(k/n) min-max upper bound, for other probability multisets it is much lower, and for some, such as uniform over 1 or over k symbols, where the probability multiset determines the distribution it is even 0.
  • By [13, Lemmas 10 and 11], for symbols appearing t times, if φt+1 ≥ Ω (t), the Good-Turing estimate is close to the underlying total probability mass, otherwise the empirical estimate is closer.
  • The Laplace estimator, βtL = 1 ∀ t, minimizes the expected loss when the underlying distribution is generated by a uniform prior over ∆k.
  • The Krichevsky-Trofimov estimator, βtKT = 1/2 ∀ t, is asymptotically min-max optimal for the cumulative regret, and minimizes the expected loss when the underlying distribution is generated according to a Dirichlet-1/2 prior.
  • Snx φnx achieves the lowest loss of any natural estimator designed with knowledge of the underlying distribution.
Conclusion
  • Since Laplace is the optimal estimator when the underlying distribution is generated from the uniform prior, it performs well in Figure 2(e), performs poorly on other distributions.
  • Even though for distributions generated by Dirichlet priors, all the estimators have similar looking regrets (Figures 2(e), 2(f)), the proposed estimator performs better than estimators which are not designed for that prior.
  • The authors relate the regret in estimating distribution to that of estimating the combined or total probability mass, defined as follows.
Reference
  • William A. Gale and Geoffrey Sampson. Good-turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3):217–237, 1995.
    Google ScholarLocate open access versionFindings
  • S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL, 1996.
    Google ScholarLocate open access versionFindings
  • Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. In NIPS, 2004.
    Google ScholarLocate open access versionFindings
  • Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38, 1994.
    Google ScholarLocate open access versionFindings
  • Fredrick Jelinek and Robert L. Mercer. Probability distribution estimation from sparse data. IBM Tech. Disclosure Bull., 1984.
    Google ScholarLocate open access versionFindings
  • Irving J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237–264, 1953.
    Google ScholarLocate open access versionFindings
  • Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.
    Google ScholarFindings
  • R. Krichevsky. Universal Compression and Retrieval. Dordrecht,The Netherlands: Kluwer, 1994.
    Google ScholarFindings
  • Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In COLT, 2015.
    Google ScholarLocate open access versionFindings
  • Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187–206, 2004.
    Google ScholarLocate open access versionFindings
  • David A. McAllester and Robert E. Schapire. On the convergence rate of Good-Turing estimators. In COLT, 2000.
    Google ScholarLocate open access versionFindings
  • Evgeny Drukh and Yishay Mansour. Concentration bounds for unigrams language model. In COLT, 2004.
    Google ScholarLocate open access versionFindings
  • Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Optimal probability estimation with applications to prediction and classification. In COLT, 2013.
    Google ScholarLocate open access versionFindings
  • Alon Orlitsky, Narayana P. Santhanam, and Junan Zhang. Always Good Turing: Asymptotically optimal probability estimation. In FOCS, 2003.
    Google ScholarFindings
  • Boris Yakovlevich Ryabko. Twice-universal coding. Problemy Peredachi Informatsii, 1984.
    Google ScholarLocate open access versionFindings
  • Boris Yakovlevich Ryabko. Fast adaptive coding algorithm. Problemy Peredachi Informatsii, 26(4):24–
    Google ScholarLocate open access versionFindings
  • Dominique Bontemps, Stephane Boucheron, and Elisabeth Gassiat. About adaptive coding on countable alphabets. IEEE Transactions on Information Theory, 60(2):808–821, 2014.
    Google ScholarLocate open access versionFindings
  • Stephane Boucheron, Elisabeth Gassiat, and Mesrob I. Ohannessian. About adaptive coding on countable alphabets: Max-stable envelope classes. CoRR, abs/1402.6305, 2014.
    Findings
  • David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
    Google ScholarLocate open access versionFindings
  • Felix Abramovich, Yoav Benjamini, David L Donoho, and Iain M Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 2006.
    Google ScholarLocate open access versionFindings
  • Peter J Bickel, Chris A Klaassen, YA’Acov Ritov, and Jon A Wellner. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.
    Google ScholarFindings
  • Andrew Barron, Lucien Birge, and Pascal Massart. Risk bounds for model selection via penalization. Probability theory and related fields, 113(3):301–413, 1999.
    Google ScholarLocate open access versionFindings
  • Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2004.
    Google ScholarFindings
  • Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, and Shengjun Pan. Competitive closeness testing. COLT, 2011.
    Google ScholarFindings
  • Jayadev Acharya, Hirakendu Das, Ashkan Jafarpour, Alon Orlitsky, Shengjun Pan, and Ananda Theertha Suresh. Competitive classification and closeness testing. In COLT, 2012.
    Google ScholarLocate open access versionFindings
  • Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. A competitive test for uniformity of monotone distributions. In AISTATS, 2013.
    Google ScholarFindings
  • Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.
    Google ScholarLocate open access versionFindings
  • Gregory Valiant and Paul Valiant. Instance optimal learning. CoRR, abs/1504.05321, 2015.
    Findings
  • Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科