Rethinking Learning-Based Method for Lossless Genome Compression

Han Yang,Fei Gu, Jieping Ye

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2023)

引用 0|浏览0
暂无评分
摘要
Lossless genome compression plays a vital role in genomic analysis procedure. The main challenges are from long range of the genome sequence and high frequency of genome variants. However, existing learning-based methods almost all rely on local DNA fragments in small windows, which makes them unable to capture deep regularities of genome sequence and may lead to unsatisfactory performance because of the large variability in different individuals. In this paper, we redesign the deep learning model and propose a simple yet effective position-driven transformer for genome data compression. Our approach, called CompressBERT, is based on two core designs. First, we introduce global position of the complete genome sequence into our deep model, which can make the genome sequence distinguishable in base level. Second, we pre-train our deep model by identifying SNP genome variants, which can further facilitate genome compression task. Furthermore, the proposed CompressBERT is validated on three datasets from different species. Experimental results show that our approach outperforms state-of-the-art methods.
更多
查看译文
关键词
lossless genome compression,transformer,genome variants,DNA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要