Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning
arxiv(2023)
摘要
The integration of visual and textual data in Vision-Language Pre-training
(VLP) models is crucial for enhancing vision-language understanding. However,
the adversarial robustness of these models, especially in the alignment of
image-text features, has not yet been sufficiently explored. In this paper, we
introduce a novel gradient-based multimodal adversarial attack method,
underpinned by contrastive learning, to improve the transferability of
multimodal adversarial samples in VLP models. This method concurrently
generates adversarial texts and images within imperceptive perturbation,
employing both image-text and intra-modal contrastive loss. We evaluate the
effectiveness of our approach on image-text retrieval and visual entailment
tasks, using publicly available datasets in a black-box setting. Extensive
experiments indicate a significant advancement over existing single-modal
transfer-based adversarial attack methods and current multimodal adversarial
attack approaches.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要