How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation
arxiv(2024)
摘要
Visual deep learning (VDL) systems have shown significant success in
real-world applications like image recognition, object detection, and
autonomous driving. To evaluate the reliability of VDL, a mainstream approach
is software testing, which requires diverse and controllable mutations over
image semantics. The rapid development of multi-modal large language models
(MLLMs) has introduced revolutionary image mutation potentials through
instruction-driven methods. Users can now freely describe desired mutations and
let MLLMs generate the mutated images.
However, the quality of MLLM-produced test inputs in VDL testing remains
largely unexplored. We present the first study, aiming to assess MLLMs'
adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment
of MLLM mutated images with their text instructions (prompts), 3) the
faithfulness of how different mutations preserve semantics that are ought to
remain unchanged, and 4) the effectiveness of detecting VDL faults. With
large-scale human studies and quantitative evaluations, we identify MLLM's
promising potentials in expanding the covered semantics of image mutations.
Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in
editing existing semantics in images (as in traditional mutations like
rotation), they generate high-quality test inputs using "semantic-additive"
mutations (e.g., "dress a dog with clothes"), which bring extra semantics to
images; these were infeasible for past approaches. Hence, we view MLLM-based
mutations as a vital complement to traditional mutations, and advocate future
VDL testing tasks to combine MLLM-based methods and traditional image mutations
for comprehensive and reliable testing.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要