A generalized sequence pattern matching algorithm using complementary dual-seeding

BIBM(2010)

引用 1|浏览6
暂无评分
摘要
In this work, we define generalized (sequence) patterns, which is based on several real Biological problems, including transcription factors (TFs) binding to transcription factor binding sites (TFBSs), cis-regulatory modules, protein domain analysis, and alternative splicing etc. Simply speaking, a generalized pattern is composed of several substrings with gaps in-between two substrings. We propose a generalized pattern matching algorithm that uses a complementary dualseeding strategy, which is sensitive to errors (both mismatches and indels). We also develop a generalized pattern matching tool, which is to our knowledge the first ever developed specially for generalized pattern matching. Rather than replacing the existing general purpose matching tools, such as BLAST, BLAT, and PatternHunter etc, our tool provides an alternative and helps users to solve real problems, especially those that can be modeled as generalized patterns. We use data randomly sampled from reference sequences of human genome (NCBI build v18) in experiments, and hit 98.74% generalized patterns on average. The tool runs on both LINUX and Windows platforms, and the memory peak goes to a little bit larger than 1GB only.
更多
查看译文
关键词
generalized sequence pattern matching algorithm,human genome,patternhunter tool,generalized patterns,complementary dualseeding,genomics,cis regulatory module,proteins,cis-regulatory module,pattern matching,protein domain analysis,alternative splicing,linux,blast tool,k-mers index,blat tool,windows platform,bioinformatics,transcription factor binding site,complementary dual seeding,splicing,protein domains,transcription factor,indexes,indexation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要