No-Reference Compression of Genomic Data Stored in FASTQ Format

BIBM(2011)

引用 22|浏览23
暂无评分
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for loss less and nearly loss less compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
更多
查看译文
关键词
fastq file,fastq format,quality data,quality information,genomic data,no-reference compression,best performance,compression system,quality value,best compression,proposed system,text data,bzip2 text compression utility,text analysis,data compression,statistical analysis,next generation sequencing,dna,bioinformatics,markov processes,genomics,entropy coding,fastq
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要