Empirical Lossless Compression Bound of a Data Sequence.
CoRR(2023)
摘要
We consider the lossless compression bound of any single data sequence. If we
fit the data by a parametric model, the entropy quantity $nH({\hat \theta}_n)$
obtained by plugging in the maximum likelihood estimate is an underestimate of
the bound, where $n$ is the number of words. Shtarkov showed that the
normalized maximum likelihood (NML) distribution or code length is optimal in a
minimax sense for any parametric family. We show by the local asymptotic
normality that the NML code length for the exponential families is $nH(\hat
\theta_n) +\frac{d}{2}\log \, \frac{n}{2\pi} +\log \int_{\Theta}
|I(\theta)|^{1/2}\, d\theta+o(1)$, where $d$ is the model dimension or
dictionary size, and $|I(\theta)|$ is the determinant of the Fisher information
matrix. We also demonstrate that sequentially predicting the optimal code
length for the next word via a Bayesian mechanism leads to the mixture code,
whose pathwise length is given by $nH({\hat \theta}_n) +\frac{d}{2}\log \,
\frac{n}{2\pi} +\log \frac{|\, I({\hat \theta}_n)|^{1/2}}{w({\hat
\theta}_n)}+o(1) $, where $w(\theta)$ is a prior. The asymptotics apply to not
only discrete symbols but also continuous data if the code length for the
former is replaced by the description length of the latter. The analytical
result is exemplified by calculating compression bounds of protein-encoding DNA
sequences under different parsing models. Typically, the highest compression is
achieved when the parsing is in phase of the amino acid codons. On the other
hand, the compression rates of pseudo-random sequences are larger than 1
regardless parsing models. These model-based results are in consistency with
that random sequences are incompressible as asserted by the Kolmogorov
complexity theory. The empirical lossless compression bound is particularly
more accurate when dictionary size is relatively large.
更多查看译文
关键词
compression,data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要