Some Investigations on Similarity Measures Based on Absent Words.

FUNDAMENTA INFORMATICAE(2020)

引用 3|浏览12
暂无评分
摘要
In this paper we investigate similarity measures based on minimal absent words, introduced by Chairungsee and Crochemore in [1]. They make use of a length-weighted index on a sample set corresponding to the symmetric difference M(x)Delta M(y) of the minimal absent words M(x) and Delta M(y) of two sequences x and y, respectively. We first propose a variant of this measure by choosing as a sample set a proper subset D(x, y) of M(x)Delta M(y), which appears to be more appropriate for distinguishing x and y. From the algebraic point of view, we prove that D(x, y) is the base of the ideal generated by M(x)Delta M(y). We then remark that such measures are able to recognize whether the sequences x and y share a common structure, but they are not able to detect the difference on the number of occurrences of such a structure in the two sequences. In order to take into account such a multiplicity, we introduce the notion of multifactor, and define a new measure that uses both absent words and multifactors. Surprisingly, we prove that this similarity measure coincides with a distance on sequences introduced by Ehrenfeucht and Haussler in [2], in the context of block-moves strategies. In this way, our result creates a non trivial bridge between similarity measures based on absent words and those based on the block-moves approach.
更多
查看译文
关键词
Minimal absent words,similarity measures,sequence comparison
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要