Rummagene: Mining Gene Sets from Supporting Materials of PMC Publications

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 0|浏览10
暂无评分
摘要
Every week thousands of biomedical research papers are published with a portion of them containing supporting tables with data about genes, transcripts, variants, and proteins. For example, supporting tables may contain differentially expressed genes and proteins from transcriptomics and proteomics assays, targets of transcription factors from ChIP-seq experiments, hits from genome-wide CRISPR screens, or genes identified to harbor mutations from GWAS studies. Because these gene sets are commonly buried in the supplemental tables of research publications, they are not widely available for search and reuse. Rummagene, available from https://rummagene.com, is a web server application that provides access to hundreds of thousands human and mouse gene sets extracted from supporting materials of publications listed on PubMed Central (PMC). To create Rummagene, we first developed a softbot that extracts human and mouse gene sets from supporting tables of PMC publications. So far, the softbot has scanned 5,448,589 PMC articles to find 121,237 articles that contain 642,389 gene sets. These gene sets are served for enrichment analysis, free text, and table title search. Users of Rummagene can submit their own gene sets to find matching gene sets ranked by their overlap with the input gene set. In addition to providing the extracted gene sets for search, we investigated the massive corpus of these gene sets for statistical patterns. We show that the number of gene sets reported in publications is rapidly increasing, containing both short sets that are highly enriched in highly studied genes, and long sets from omics profiling. We also demonstrate that the gene sets in Rummagene can be used for transcription factor and kinase enrichment analyses, and for gene function predictions. By combining gene set similarity with abstract similarity, Rummagene can be used to find surprising relationships between unexpected biological processes, concepts, and named entities. Finally, by overlaying the Rummagene gene set space with the Enrichr gene set space we can discover areas of biological and biomedical knowledge unique to each resource. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
mining rummagene sets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要