Cascade & allocate: A cross-structure adversarial attack against models fusing vision and language

INFORMATION FUSION(2024)

引用 0|浏览20
暂无评分
摘要
The data fusion systems between multiple modals have attracted a wide range of concerns about their adversarial robustness. Recent image captioning attacks lack cross-structure transferability against models with diverse structures. Therefore, the attackers demonstrate satisfactory performance only when they gain full or partial knowledge of target image captioning models. So far, the cross-structure transferability of multi-modal adversarial examples has not been thoroughly investigated. In this paper, a theorem is proposed to analyze the upper bound captioning adversarial transferability. We design a new transfer-based adversarial attack method (Cascade & Allocate) against image captioning models. Our method can be divided into two steps. The first step randomly selects a set of candidate models with diverse structures, and we use a momentum-like strategy to generate perturbations. This 'model cascade' step narrows the gap of the gradient directions between different image captioning models, therefore enhancing the cross-model transferability and generating model-agnostic adversarial perturbations. The second step utilizes the upper bound transferability theorem to allocate the weight of perturbations per model. This 'weight allocate' step enhances the weight of model with the most consistent gradient in all candidate models, which generates more transferable adversarial examples. In the experiments, we compare our approach with other captioning and ensemble-based attacks against five black box models with different structures on MS COCO and Flickr-30k datasets. The adversarial examples exhibit state-of-the-art adversarial transferability against five black-box models with different structures. The results demonstrate that our approach is practical and generalizes well to a wide range of captioning models.
更多
查看译文
关键词
Adversarial attacks,Image captioning,Cross-structure transferability,Multi-modal,Deep neural networks,Vision-to-language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要