Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
arxiv(2024)
摘要
Multimodal large language models (MLLMs) have shown impressive reasoning
abilities, which, however, are also more vulnerable to jailbreak attacks than
their LLM predecessors. Although still capable of detecting unsafe responses,
we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be
easily bypassed due to the introduction of image features. To construct robust
MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free
protecting approach that exploits the inherent safety awareness of MLLMs, and
generates safer responses via adaptively transforming unsafe images into texts
to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs.
Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO
enhances model safety significantly (e.g., a 37.6
MM-SafetyBench (SD+OCR), and 71.3
consistently maintaining utility results on common MLLM benchmarks.
Furthermore, we show that ECSO can be used as a data engine to generate
supervised-finetuning (SFT) data for MLLM alignment without extra human
intervention.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要