Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink

Tawfik Yasser, Tamer Arafa,Mohamed El-Helw,Ahmed Awad

2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES)(2023)

引用 0|浏览1
暂无评分
摘要
Big Data Stream processing engines such as Apache Flink use windowing techniques to handle unbounded streams of events. Gathering all pertinent input within a window is crucial for event-time windowing since it affects how accurate results are. A significant part of this process is played by watermarks, which are unique timestamps that show the passage of events in time. However, the current watermark generation method in Apache Flink, which works at the level of the input stream, tends to favor faster sub-streams, resulting in dropped events from slower sub-streams. In our analysis, we found that Apache Flink’s vanilla watermark generation approach caused around 33% loss of data if 50% of the keys around the median are delayed. Furthermore, the loss surpassed 37% when 50% of random keys are delayed.In this paper, we present a novel strategy called keyed watermarks to overcome data loss and increase the accuracy of data processing to at least 99% in most cases. We enable separate progress tracking by creating a unique watermark for each logical sub-stream (key). In our study, we outline the architectural and API changes necessary to implement keyed watermarks and discuss our experience in extending Apache Flink’s enormous code base. Additionally, we compare the effectiveness of our strategy against the conventional watermark generation method in terms of the accuracy of event-time tracking.
更多
查看译文
关键词
Keyed Watermarks,Big Data Stream Processing,Event-Time Tracking,Apache Flink
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要