Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
arxiv(2024)
摘要
The success of Large Language Models (LLMs) has led to a parallel rise in the
development of Large Multimodal Models (LMMs), such as Gemini-pro, which have
begun to transform a variety of applications. These sophisticated multimodal
models are designed to interpret and analyze complex data, integrating both
textual and visual information on a scale previously unattainable, opening new
avenues for a range of applications. This paper investigates the applicability
and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision
Transformer (ViT) models in addressing critical security challenges. We focus
on two distinct tasks: a visually evident task of detecting simple triggers,
such as small squares in images, indicative of potential backdoors, and a
non-visually evident task of malware classification through visual
representations. Our results highlight a significant divergence in performance,
with Gemini-pro falling short in accuracy and reliability when compared to
fine-tuned ViT models. The ViT models, on the other hand, demonstrate
exceptional accuracy, achieving near-perfect performance on both tasks. This
study not only showcases the strengths and limitations of prompt-engineered
LMMs in cybersecurity applications but also emphasizes the unmatched efficacy
of fine-tuned ViT models for precise and dependable tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要