GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences

Veniamin Fishman,Yuri Kuratov, Maxim Petrov, Aleksei Shmelev, Denis Shepelin,Nikolay Chekanov,Olga L. Kardymon, Mikhail Burtsev

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 2|浏览1
暂无评分
摘要
Abstract The field of genomics has seen substantial advancements through the application of artificial intelligence (AI), with machine learning revealing the potential to interpret genomic sequences without necessitating an exhaustive experimental analysis of all the intricate and interconnected molecular processes involved in DNA functioning. However, precise decoding of genomic sequences demands the comprehension of rich contextual information spread over thousands of nucleotides. Presently, only a few architectures exist that can process such extensive inputs, and they require exceptional computational resources. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 thousands base pairs. We offer pre-trained versions of GENA-LM and demonstrate their capacity for fine-tuning to address complex biological questions with modest computational requirements. We also illustrate diverse applications of GENA-LM for various downstream genomic tasks, showcasing its performance in either matching or exceeding that of prior models, whether task-specific or universal. All models are publicly accessible on GitHub https://github.com/AIRI-Institute/GENA_LM and as pre-trained models with gena-lm-prefix on HuggingFace https://huggingface.co/AIRI-Institute . Contacts minja-f@ya.ru , kardymon@airi.net , am@lims.ac.uk
更多
查看译文
关键词
long dna sequences,dna sequences,open-source
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要