On combining system and machine learning performance tuning for distributed data stream applications

Distributed Parallel Databases(2023)

引用 0|浏览4
暂无评分
摘要
The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.
更多
查看译文
关键词
data stream applications,tuning,machine learning,performance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要