Maintaining a microbial genome & metagenome data analysis system in an academic setting

SSDBM(2014)

引用 6|浏览34
暂无评分
摘要
The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time. From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.
更多
查看译文
关键词
data warehouse,database applications,genome data analysis system,management,performance,reliability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要