Enhancing the prediction of protein coding regions in biological sequence via a deep learning framework with hybrid encoding

biorxiv(2022)

引用 3|浏览4
暂无评分
摘要
Protein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier's capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping 3-mer feature, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. First, 3-mer feature that counts the occurrence frequency of trinucleotides in a biological sequence only reflects local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. Second, kmer features of length k larger than three (e.g., hexamer) may also contain useful information. Based on the two points, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploit global sequence order information, non-overlapping gapped kmer (gkm) features and statistical dependencies among coding labels. 3-fold cross-validation tests on human and mouse biological sequence demonstrate that our proposed method significantly outperforms existing state-of-the-art methods. (c) 2022 Published by Elsevier Inc.
更多
查看译文
关键词
Deep learning,Bioinformatics,Protein coding regions prediction,Hybrid encoding,Label dependency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要