Sublinear estimation of entropy and information distances

ACM Transactions on Algorithms(2009)

引用 25|浏览52
暂无评分
摘要
In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice. We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.
更多
查看译文
关键词
property testing,data mining,data streams,various query,data item,entropy,information divergences,information distance,data-stream model,various oracle model,combined oracle,sublinear estimation,aforementioned oracle model,powerful oracle,data stream,various information-divergences,kullback leibler,machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要