Testing Identity of Multidimensional Histograms.

COLT(2019)

引用 25|浏览127
暂无评分
摘要
We investigate the problem of identity testing for multidimensional histogram distributions. A distribution $p: D rightarrow mathbb{R}_+$, where $D subseteq mathbb{R}^d$, is called a {$k$-histogram} if there exists a partition of the domain into $k$ axis-aligned rectangles such that $p$ is constant within each such rectangle. Histograms are one of the most fundamental non-parametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. More specifically, let $q$ be an unknown $d$-dimensional $k$-histogram and $p$ be an explicitly given $k$-histogram. We want to correctly distinguish, with probability at least $2/3$, between the case that $p = q$ versus $|p-q|_1 geq epsilon$. We design a computationally efficient algorithm for this hypothesis testing problem with sample complexity $O((sqrt{k}/epsilon^2) log^{O(d)}(k/epsilon))$. Our algorithm is robust to model misspecification, i.e., succeeds even if $q$ is only promised to be {em close} to a $k$-histogram. Moreover, for $k = 2^{Omega(d)}$, we show a nearly-matching sample complexity lower bound of $Omega((sqrt{k}/epsilon^2) (log(k/epsilon)/d)^{Omega(d)})$ when $dgeq 2$. Prior to our work, the sample complexity of the $d=1$ case was well-understood, but no algorithm with sub-learning sample complexity was known, even for $d=2$. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要