Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models
arxiv(2024)
摘要
This study provides a comparative analysis of state-of-the-art large language
models (LLMs), analyzing how likely they generate vulnerabilities when writing
simple C programs using a neutral zero-shot prompt. We address a significant
gap in the literature concerning the security properties of code produced by
these models without specific directives. N. Tihanyi et al. introduced the
FormAI dataset at PROMISE '23, containing 112,000 GPT-3.5-generated C programs,
with over 51.24
the FormAI-v2 dataset comprising 265,000 compilable C programs generated using
various LLMs, including robust models such as Google's GEMINI-pro, OpenAI's
GPT-4, and TII's 180 billion-parameter Falcon, to Meta's specialized 13
billion-parameter CodeLLama2 and various other compact models. Each program in
the dataset is labelled based on the vulnerabilities detected in its source
code through formal verification using the Efficient SMT-based Context-Bounded
Model Checker (ESBMC). This technique eliminates false positives by delivering
a counterexample and ensures the exclusion of false negatives by completing the
verification process. Our study reveals that at least 63.47
programs are vulnerable. The differences between the models are minor, as they
all display similar coding errors with slight variations. Our research
highlights that while LLMs offer promising capabilities for code generation,
deploying their output in a production environment requires risk assessment and
validation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要