FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics

ChenRui Duan,Zelin Zang,Yongjie Xu, Hang He,Zihan Liu, Zijia Song,Ju-Sheng Zheng,Stan Z. Li

CoRR(2024)

引用 0|浏览2
暂无评分
摘要
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model , pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed on eight datasets.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要