Vulnerability Detection with Code Language Models: How Far Are We?
arxiv(2024)
摘要
In the context of the rising interest in code language models (code LMs) and
vulnerability detection, we study the effectiveness of code LMs for detecting
vulnerabilities. Our analysis reveals significant shortcomings in existing
vulnerability datasets, including poor data quality, low label accuracy, and
high duplication rates, leading to unreliable model performance in realistic
vulnerability detection scenarios. Additionally, the evaluation methods used
with these datasets are not representative of real-world vulnerability
detection.
To address these challenges, we introduce PrimeVul, a new dataset for
training and evaluating code LMs for vulnerability detection. PrimeVul
incorporates a novel set of data labeling techniques that achieve comparable
label accuracy to human-verified benchmarks while significantly expanding the
dataset. It also implements a rigorous data de-duplication and chronological
data splitting strategy to mitigate data leakage issues, alongside introducing
more realistic evaluation metrics and settings. This comprehensive approach
aims to provide a more accurate assessment of code LMs' performance in
real-world conditions.
Evaluating code LMs on PrimeVul reveals that existing benchmarks
significantly overestimate the performance of these models. For instance, a
state-of-the-art 7B model scored 68.26
PrimeVul. Attempts to improve performance through advanced training techniques
and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin
to random guessing in the most stringent settings. These findings underscore
the considerable gap between current capabilities and the practical
requirements for deploying code LMs in security roles, highlighting the need
for more innovative research in this domain.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要