Studying the Practices of Testing Machine Learning Software in the Wild
CoRR(2023)
摘要
Background: We are witnessing an increasing adoption of machine learning
(ML), especially deep learning (DL) algorithms in many software systems,
including safety-critical systems such as health care systems or autonomous
driving vehicles. Ensuring the software quality of these systems is yet an open
challenge for the research community, mainly due to the inductive nature of ML
software systems. Traditionally, software systems were constructed deductively,
by writing down the rules that govern the behavior of the system as program
code. However, for ML software, these rules are inferred from training data.
Few recent research advances in the quality assurance of ML systems have
adapted different concepts from traditional software testing, such as mutation
testing, to help improve the reliability of ML software systems. However, it is
unclear if any of these proposed testing techniques from research are adopted
in practice. There is little empirical evidence about the testing strategies of
ML engineers. Aims: To fill this gap, we perform the first fine-grained
empirical study on ML testing practices in the wild, to identify the ML
properties being tested, the followed testing strategies, and their
implementation throughout the ML workflow. Method: First, we systematically
summarized the different testing strategies (e.g., Oracle Approximation), the
tested ML properties (e.g., Correctness, Bias, and Fairness), and the testing
methods (e.g., Unit test) from the literature. Then, we conducted a study to
understand the practices of testing ML software. Results: In our findings: 1)
we identified four (4) major categories of testing strategy including Grey-box,
White-box, Black-box, and Heuristic-based techniques that are used by the ML
engineers to find software bugs. 2) We identified 16 ML properties that are
tested in the ML workflow.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要