TGAAL: Combining Transformer-based GAN and active learning to identify the coding potential of sORFs in plant lncRNAs.

2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)(2023)

引用 0|浏览2
暂无评分
摘要
Some small open reading frames (sORFs) in plant long non-coding RNAs (lncRNAs) are capable of encoding small peptides, which play key roles in the growth and development of organisms. Therefore, it is particularly important to identify the coding potential of sORFs in plant lncRNAs. However, existing methods often ignore the differences in length distribution between coding sORFs (csORFs) and non-coding sORFs (non-csORFs), which may lead to incorrect identification of csORFs. To address this issue, we propose a novel method to identify the coding potential of sORFs in plant lncRNAs, named Transformer Generative Adversarial Active Learning (TGAAL), which combines Transformer-based Generative Adversarial Network (TGAN) and active learning based on KL-topk sampling strategy. TGAN can generate sORF sequences in a specific length interval, which have the same class as the input sORFs. Meanwhile, using active learning based on KL-topk sampling strategy, samples with high confidence can be selected for data augmentation. 5-fold cross-validation shows that KL-topk sampling strategy significantly improves the prediction performance compared with commonly adopted sampling strategies. The experimental results show that TGAAL significantly outperforms existing methods in identifying the coding potential of sORFs in Arabidopsis thaliana, reaching 0.7761, 0.7906 and 0.7529 unweighted average recall in three sORF length intervals, respectively.
更多
查看译文
关键词
Transformer,Generative Adversarial Network,Active Learning,LncRNA,sORFs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要