Shock: Active Storage for Multicloud Streaming Data Analysis

2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)(2015)

引用 7|浏览60
暂无评分
摘要
Access to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the dataset size dramatically within the first few processing steps, owing to biologially-motivated data compression. Metagenomic data compresses poorly, and so metagenomic workflows add to the size of the data set along the processing pipeline. Thus, wide-area, high-throughput access to the data is essential. To address this problem, we developed Shock, a data store for files, their associated metadata, and indexes that allow Shock to provide different views into the data. Shock comprises three major components: a web service that provides a RESTful API, backend data storage for files, and storage for object metadata. Shock has proven to be a stable data store for MG-RAST, an application that served over 40,000 users in 2014 on a server that houses more than 3 million data objects. Moreover, Shock provides both subselection and high-performance file transfer capabilities that serve most usages.
更多
查看译文
关键词
bioinformatics,metagenomics,active object store,distributed wide-area computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要