CS2A: A Compressed Suffix Array-Based Method for Short Read Alignment

2016 Data Compression Conference (DCC)(2016)

引用 9|浏览26
暂无评分
摘要
Next generation sequencing technologies generate normous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online.
更多
查看译文
关键词
CS2A,short-read alignment,next generation sequencing technologies,sequencing errors,repetitive sequences,space-efficient compressed suffix array-based method,high-order empirical entropy,BWA,constant-sized alphabets,pattern matching,memory usage,mapping accuracy,source code
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要