Learning Defect Prediction from Unrealistic Data

2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)(2023)

引用 0|浏览10
暂无评分
摘要
Pretrained models of code, such as CodeBERT and CodeT5, have become popular\nchoices for code understanding and generation tasks. Such models tend to be\nlarge and require commensurate volumes of training data, which are rarely\navailable for downstream tasks. Instead, it has become popular to train models\nwith far larger but less realistic datasets, such as functions with\nartificially injected bugs. Models trained on such data, however, tend to only\nperform well on similar data, while underperforming on real world programs. In\nthis paper, we conjecture that this discrepancy stems from the presence of\ndistracting samples that steer the model away from the real-world task\ndistribution. To investigate this conjecture, we propose an approach for\nidentifying the subsets of these large yet unrealistic datasets that are most\nsimilar to examples in real-world datasets based on their learned\nrepresentations. Our approach extracts high-dimensional embeddings of both\nreal-world and artificial programs using a neural model and scores artificial\nsamples based on their distance to the nearest real-world sample. We show that\ntraining on only the nearest, representationally most similar samples while\ndiscarding samples that are not at all similar in representations yields\nconsistent improvements across two popular pretrained models of code on two\ncode understanding tasks. Our results are promising, in that they show that\ntraining models on a representative subset of an unrealistic dataset can help\nus harness the power of large-scale synthetic data generation while preserving\ndownstream task performance. Finally, we highlight the limitations of applying\nAI models for predicting vulnerabilities and bugs in real-world applications
更多
查看译文
关键词
n/a
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要