Competitive tests and estimators for properties of distributions

Competitive tests and estimators for properties of distributions(2012)

引用 23|浏览6
暂无评分
摘要
We derive competitive tests and estimators for several properties of discrete distributions, based on their i.i.d. sequences. We focus on symmetric properties that depend only on the multiset of probability values in the distributions and not on specific symbols of the alphabet that assume these values. Many applications of probability estimation, statistics and machine learning involve such properties.Our method of probability estimation, called profile maximum likelihood (PML), involves maximizing the likelihood of observing the profile of the given sequences, i.e., the multiset of symbol counts in the sequences. It has been used successfully for universal compression of large alphabet data sources, and has been shown empirically to perform well for other probability estimation problems like classification and distribution multiset estimation. We provide competitive estimation guarantees for the PML method for several such problems.For testing closeness of distributions, i.e., finding whether two given i.i.d. sequences of length n are generated by the same distribution or by two different ones, our schemes have an error probability of at most d˙e7n2/3 whenever the best possible error probability is δ ≤ e-14n2/3 . The running time of our scheme is O (n). We do not make any assumptions on the distributions, including on their support size. In terms of sample complexity, this implies that if there is a closeness test which takes sequences of length n and has error probability at most δ, our tests have the same error guarantee on sequences of length n' = O&parl0;max&cubl0;n3 &parl0;log 14d&parr0; 3,n&cubr0;&parr0; . Similar results are implied for the related problem of classification. For finding the probability multiset of a discrete distribution using a length-n i.i.d. sequence drawn from it, we show the following guarantee for the PML-based estimator. For any class of distributions and any distance metric on their probability multisets, if there is an estimator that approximates all distributions in this class to within a distance of ε 0 with error probability at most δ ≤ e-6n1/2 , then the PML estimator is within a distance of 2ε with error probability at most δ · e6n1/2 . Equivalently, the PML estimator approximates distributions to within a distance of 2ε with error probability delta using sequences of length n' = O&parl0;max&cubl0;n2 &parl0;log 14d&parr0; 2,n&cubr0;&parr0; . Thus, this estimator is competitive with other estimators, including the one by Valiant et al. that approximates distributions of superlinear support size k = O (ε2.1n log(n)) to within a relative earthmover distance of ε and whose error probability can be shown to be at most e-n0.9 . However, unlike the case of closeness testing, we do not yet have efficient schemes for computing the PML distribution. We extend the results for PML for distribution multiset estimation to two related problems of estimating the parameter multiset of multiple distributions or processes. These include the problems of estimating the multiset of success probabilities of Bernoulli processes, and the multiset of means of Poisson distributions.
更多
查看译文
关键词
probability multiset,PML estimator,error probability,possible error probability,discrete distribution,probability estimation problem,distribution multiset estimation,length n,competitive test,error probability delta,probability estimation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要