A Repository-Level Dataset For Detecting, Classifying and Repairing Software Vulnerabilities
CoRR(2024)
摘要
Open-Source Software (OSS) vulnerabilities bring great challenges to the
software security and pose potential risks to our society. Enormous efforts
have been devoted into automated vulnerability detection, among which deep
learning (DL)-based approaches have proven to be the most effective. However,
the current labeled data present the following limitations: (1) Tangled
Patches: Developers may submit code changes unrelated to vulnerability fixes
within patches, leading to tangled patches. (2) Lacking
Inter-procedural Vulnerabilities: The existing vulnerability datasets
typically contain function-level and file-level vulnerabilities, ignoring the
relations between functions, thus rendering the approaches unable to detect the
inter-procedural vulnerabilities. (3) Outdated Patches: The existing
datasets usually contain outdated patches, which may bias the model during
training.
To address the above limitations, in this paper, we propose an automated data
collection framework and construct the first repository-level high-quality
vulnerability dataset named ReposVul. The proposed framework mainly
contains three modules: (1) A vulnerability untangling module, aiming at
distinguishing vulnerability-fixing related code changes from tangled patches,
in which the Large Language Models (LLMs) and static analysis tools are jointly
employed. (2) A multi-granularity dependency extraction module, aiming at
capturing the inter-procedural call relationships of vulnerabilities, in which
we construct multiple-granularity information for each vulnerability patch,
including repository-level, file-level, function-level, and line-level. (3) A
trace-based filtering module, aiming at filtering the outdated patches, which
leverages the file path trace-based filter and commit time trace-based filter
to construct an up-to-date dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要