Sample-Optimal Identity Testing with High Probability.
international colloquium on automata, languages and programming(2018)
摘要
We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0u003c epsilon, delta u003c 1$, we wish to distinguish, with probability at least $1-delta$, whether the distributions are identical versus $epsilon$-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that $delta = Omega(1)$, for which the sample complexity of identity testing is known to be $Theta(sqrt{n}/epsilon^2)$. Typical applications of distribution property testing require small values of the confidence parameter $delta$ (which correspond to small $p$-values in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of $delta$ via black-box amplification, which multiplies the required number of samples by $Theta(log(1/delta))$. We show that this upper bound is suboptimal for any $delta = o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is [ Thetaleft( frac{1}{epsilon^2}left(sqrt{n log(1/delta)} + log(1/delta) right)right) ] for any $n, epsilon$, and $delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $mathrm{d}_{TV}(p, U_n) geq epsilon$, we simply threshold $mathrm{d}_{TV}(hat{p}, U_n)$, where $hat{p}$ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要