Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance
arxiv(2022)
摘要
The edit distance is a metric of dissimilarity between strings, widely
applied in computational biology, speech recognition, and machine learning. Let
e_k(n) denote the average edit distance between random, independent strings
of n characters from an alphabet of size k. For k ≥ 2, it is an open
problem how to efficiently compute the exact value of α_k(n) =
e_k(n)/n as well as of α_k = lim_n →∞α_k(n), a
limit known to exist.
This paper shows that α_k(n)-Q(n) ≤α_k ≤α_k(n), for
a specific Q(n)=Θ(√(log n / n)), a result which implies that
α_k is computable. The exact computation of α_k(n) is explored,
leading to an algorithm running in time T=𝒪(n^2kmin(3^n,k^n)), a
complexity that makes it of limited practical use.
An analysis of statistical estimates is proposed, based on McDiarmid's
inequality, showing how α_k(n) can be evaluated with good accuracy, high
confidence level, and reasonable computation time, for values of n say up to
a quarter million. Correspondingly, 99.9% confidence intervals of width
approximately 10^-2 are obtained for α_k.
Combinatorial arguments on edit scripts are exploited to analytically
characterize an efficiently computable lower bound β_k^* to α_k,
such that lim_k →∞β_k^*=1. In general, β_k^* ≤α_k ≤ 1-1/k; for k greater than a few dozens, computing β_k^*
is much faster than generating good statistical estimates with confidence
intervals of width 1-1/k-β_k^*.
The techniques developed in the paper yield improvements on most previously
published numerical values as well as results for alphabet sizes and string
lengths not reported before.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要