Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!

ICSE(2023)

引用 0|浏览4
暂无评分
摘要
Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.
更多
查看译文
关键词
Vulnerability detection, deep learning, data sampling, interpretable AI
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要