DistStream: An Order Aware Distributed Framework for Online Offline Stream Clustering Algorithms

40th IEEE International Conference on Distributed Computing Systems (ICDCS 2020)(2020)

引用 2|浏览26
暂无评分
摘要
Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, eg, IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an onlineoffline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.
更多
查看译文
关键词
Stream clustering,data stream,scalability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要