LsHASHq: A string matching algorithm exploiting longer q-gram shifting

Information Processing & Management(2022)

引用 2|浏览18
暂无评分
摘要
String matching is a classical computer science problem where we search for all the occurrences of a text string of size m, typically called pattern, in a string of size n, where both strings are drawn from the same alphabet. It is an essential task for many applications such as data mining, web search engines, bioinformatics, and natural language processing. Fast hash algorithms were developed to speed up the searching process. Here, we compare the hash value of strings (signature) instead of the letters. The hash function allows exploiting bitwise operations while considering the alphabet’s and pattern’s sizes. However, the efficiency of the hash algorithms calls for further improvements. The problem with q-gram hash algorithms is that the shift skips at most m−q+1 positions, where m is the same as before, and q is the length of hashed q-gram. For a fixed m, the number of skipped positions decreases as q increases. This paper presents a new variation of the q-gram hash algorithm, which elongates the shift by skipping at most m positions over text. Theoretically, the proposed hash algorithm, namely, Longer shift HASHq (LsHASHq), has a longer shift than the state-of-the-art hash algorithms. Experimentally, the new algorithm is the fastest among the following algorithms: BNDMq, BXSq, EPSM, FHASHq, FSBNDMq, HASHq, LWFRq, QLQS, SBNDMq, TWFRq, and WFRq on different natural language texts for m>10. For human genome sequence the new algorithm was second fastest for short patterns of length 10.
更多
查看译文
关键词
String matching algorithms,Pattern matching,q-gram hashing,Online search,Sequence analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要