Coco: An Application To Store High-Throughput Sequencing Data In Compact Text And Binary File Formats

2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)(2015)

引用 1|浏览28
暂无评分
摘要
The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.The basic principles used to decrease file size include storage of only one copy of a sequence when reads are present in multiple copies; alphabetical sorting of all reads and storage of only the differences (suffixes) between consecutive reads; and optimization of the number of bits/bytes required to store the information in binary formats. While file size reduction depends on properties of the sequencing data, the size of the resulting files can be as low as 0.1%-5% of the original FASTQ, SRA, or BAM files. The greatest advantage of the proposed formats however, is based on its time and memory efficiency. The time required to convert reads from FASTQ/FAST A files into the proposed formats is up to 10 times faster than gzip and SRA. The conversion of files in the proposed formats back to FAST A is limited only by the time required to read the file from the hard drive.We present the source code of the C++ object (class) implemented to store, sort, and perform I/O operations with equal length subsequences; and two executable LINUX command line applications (CoCo and CoCo-PIus) able to work with all types of sequencing data including paired-end and flexible size reads. Source code, Linux executables, as well as user manual can be downloaded from http://bgl.utmb.edu/publications/34-cocoplus.
更多
查看译文
关键词
HTS Data,File Formats,HTS File Converter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要