Sample-Optimal Identity Testing with High Probability.

international colloquium on automata, languages and programming(2018)

引用 40|浏览150
暂无评分
摘要
We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0u003c epsilon, delta u003c 1$, we wish to distinguish, with probability at least $1-delta$, whether the distributions are identical versus $epsilon$-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that $delta = Omega(1)$, for which the sample complexity of identity testing is known to be $Theta(sqrt{n}/epsilon^2)$. Typical applications of distribution property testing require small values of the confidence parameter $delta$ (which correspond to small $p$-values in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of $delta$ via black-box amplification, which multiplies the required number of samples by $Theta(log(1/delta))$. We show that this upper bound is suboptimal for any $delta = o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is [ Thetaleft( frac{1}{epsilon^2}left(sqrt{n log(1/delta)} + log(1/delta) right)right) ] for any $n, epsilon$, and $delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $mathrm{d}_{TV}(p, U_n) geq epsilon$, we simply threshold $mathrm{d}_{TV}(hat{p}, U_n)$, where $hat{p}$ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要