Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems

BeyondMR@SIGMOD(2018)

引用 14|浏览86
暂无评分
摘要
The performance of modern distributed stream processing systems is largely dependent on balanced distribution of the workload across cluster. Input streams with large, skewed domains pose challenges to these systems, especially for stateful applications. Key splitting, where state of a single key is partially maintained across multiple workers, is a simple yet effective technique to reduce load imbalance in such systems. However it comes with the cost of increased memory overhead which has been neglected by existing techniques so far. In this paper we present a novel stream partitioning algorithm for intra-operator parallelism which adapts to the underlying stream distribution in an online manner and provides near-optimal load imbalance with minimal memory overhead. Our technique relies on explicitly routing frequent items using a greedy heuristic which considers both load imbalance and space requirements. It uses hashing for in frequent items to keep the size of routing table small. Through extensive experimentation with real and synthetic datasets, we show that our proposed solution consistently provides near-optimal load imbalance and memory footprint over variety of distributions. Our experiments on Apache Storm show up to an order of magnitude increase in overall throughput and up to 80% space savings over state-of-the-art stream partitioning techniques.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要