Can Machine Learning Pipelines Be Better Configured?

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023(2023)

引用 0|浏览4
暂无评分
摘要
A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline's performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as inefficient executions, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. A systematic understanding of PLC issues helps configure effective ML pipelines and identify misconfigured ones. To this end, we conduct the first empirical study of PLC issues' pervasiveness, impact and root causes. To facilitate scalable in-depth analysis, we develop Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and detects their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.
更多
查看译文
关键词
Machine Learning Libraries,Empirical Study
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要