genomicBERT and data-free deep-learning model evaluation

Tyrone Chen, Navya Tyagi, Sarthak Chauhan,Anton Y Peleg,Sonika Tyagi

biorxiv(2023)

引用 0|浏览15
暂无评分
摘要
The emerging field of Genome-NLP (Natural Language Processing) aims to analyse biological sequence data using machine learning (ML), offering significant advancements in data-driven diagnostics. Three key challenges exist in Genome-NLP. First, long biomolecular sequences require "tokenisation" into smaller subunits, which is non-trivial since many biological "words" remain unknown. Second, ML methods are highly nuanced, reducing interoperability and usability. Third, comparing models and reproducing results are difficult due to the large volume and poor quality of biological data. To tackle these challenges, we developed the first automated Genome-NLP workflow that integrates feature engineering and ML techniques. The workflow is designed to be species and sequence agnostic. In this workflow: a) We introduce a new transformer-based model for genomes called genomicBERT , which empirically tokenises sequences while retaining biological context. This approach minimises manual preprocessing, reduces vocabulary sizes, and effectively handles out-of-vocabulary "words". (b) We enable the comparison of ML model performance even in the absence of raw data. To facilitate widespread adoption and collaboration, we have made genomicBERT available as part of the publicly accessible conda package called genomeNLP . We have successfully demonstrated the application of genomeNLP on multiple case studies, showcasing its effectiveness in the field of Genome-NLP. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
deep-learning deep-learning,model,evaluation,data-free
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要