Information-Theoretic Analysis of Haplotype Assembly.
IEEE Trans. Information Theory(2017)
摘要
This paper studies the haplotype assembly problem from an information-theoretic perspective. In the human genome, a haplotype is a sequence of nucleotide bases on a chromosome that differ from the bases in the corresponding positions on the other chromosome in a homologous pair. Haplotype sequences can conveniently be represented by binary strings, which enable us to transform the bioinformatics problem of haplotype assembly into an equivalent information-theoretic problem. Information about the order of bases in a genome is readily inferred using short reads provided by high-throughput DNA sequencing technologies. Performing haplotype assembly is challenging due to limited lengths of the reads and the presence of sequencing errors. In this paper, the recovery of the target pair of haplotype sequences using short reads is transformed into an equivalent joint source-channel coding problem. Two binary messages, representing haplotypes and chromosome memberships of reads, are encoded and transmitted over a channel with erasures and errors, where the channel model reflects salient features of high-throughput sequencing. The focus of this paper is on determining the required number of reads for reliable haplotype reconstruction. For the error-free reading case, erasure decoding is shown to be one of the optimal algorithms enabling reliable haplotype assembly. For the erroneous reading case, spectral partitioning is proved to be an efficient algorithm with orderwise optimal bounds.
更多查看译文
关键词
Biological cells,Sequential analysis,Decoding,Bioinformatics,Genomics,Partitioning algorithms,DNA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络