Dela - Sharing Large Datasets Between Hadoop Clusters

2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017)(2017)

引用 0|浏览10
暂无评分
摘要
Big data has, in recent years, revolutionised an evergrowing number of fields, from machine learning to climate science to genomics. The current state-of-the-art for storing large datasets is either object stores or distributed filesystems, with Hadoop being the dominant open-source platform for managing 'Big Data'. Existing large-scale storage platforms, however, lack support for the efficient sharing of large datasets over the Internet. Those systems that are widely used for the dissemination of large files, like BitTorrent, need to be adapted to handle challenges such as network links with both high latency and high bandwidth, and scalable storage backends that are optimised for streaming and not random access.In this paper, we introduce Dela, a peer-to-peer data-sharing service integrated into the Hops Hadoop platform that provides an end-to-end solution for dataset sharing. Dela is designed for large-scale storage backends and data transfers that are both non-intrusive to existing TCP network traffic and provide higher network throughput than TCP on high latency, high bandwidth network links, such as transatlantic network links. Dela provides a pluggable storage layer, implementing two alternative ways for clients to access shared data: stream processing of data as it arrives with Kafka, and traditional offline access to data using the Hadoop Distributed Filesystem. Dela is the first step for the Hadoop platform towards creating an open dataset ecosystem that supports user-friendly publishing, searching, and downloading of large datasets.
更多
查看译文
关键词
dataset sharing,Hadoop,peer-to-peer,BitTorrent,Big Data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要