Kmer-Node2Vec: a Fast and Efficient Method for Kmer Embedding from the Kmer Co-occurrence Graph, with Applications to DNA Sequences

Zhaochong Yu, Zihang Yang, Qingyang Lan, Yuchuan Wang, Feijuan Huang,Yuanzhe Cai

2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC(2023)

引用 0|浏览0
暂无评分
摘要
Learning low-dimensional continuous vector representation for short k-mers divided from long DNA sequences is key to DNA sequence modeling that can be utilized in many bioinformatics investigations, such as DNA sequence retrieval and classification. DNA2Vec is the most widely used method for DNA sequence embedding. However, it poorly scales to large data sets due to its extremely long training time in kmer embedding. In this paper, we propose a novel efficient graph-based kmer embedding method, named Kmer-Node2Vec, to tackle this concern. Our method converts the large DNA corpus into one kmer co-occurrence graph, and extracts kmer relation on the graph by random walks to learn fast and high-quality kmer embedding. Extensive experiments show that our method is faster than DNA2Vec by 29 times for training on a 4GB data set, and on par with DNA2Vec in terms of task-specific accuracy of sequence retrieval and classification.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要