PatchDB: A Large-Scale Security Patch Dataset

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)(2021)

引用 17|浏览24
暂无评分
摘要
Security patches, embedding both vulnerable code and the corresponding fixes, are of great significance to vulnerability detection and software maintenance. However, the existing patch datasets suffer from insufficient samples and low varieties. In this paper, we construct a large-scale patch dataset called PatchDB that consists of three components, namely, NVD-based dataset, wild-based dataset, and synthetic dataset. The NVD-based dataset is extracted from the patch hyperlinks indexed by the NVD. The wild-based dataset includes security patches that we collect from the commits on GitHub. To improve the efficiency of data collection and reduce the effort on manual verification, we develop a new nearest link search method to help find the most promising security patch candidates. Moreover, we provide a synthetic dataset that uses a new oversampling method to synthesize patches at the source code level by enriching the control flow variants of original patches. We conduct a set of studies to investigate the effectiveness of the proposed algorithms and evaluate the properties of the collected dataset. The experimental results show that PatchDB can help improve the performance of security patch identification.
更多
查看译文
关键词
security patch,open source software,dataset
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要