Improving Fine-tuning Pre-trained Models on Small Source Code Datasets via Variational Information Bottleneck

2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)(2023)

引用 0|浏览32
暂无评分
摘要
Small datasets are common in software engineering tasks such as linguistic smell detection and code runtime complexity prediction, as crafting these datasets often involves expert knowledge. Prior work usually applies machine learning algorithms (e.g., logistic regression and SVM) with hand-crafted features to tackle them, which could outperform neural models such as CNN. Recently, researchers have employed fine-tuning large pre-trained code models on various code-related tasks thanks to their transferability. However, it might be still instable and overfitting when fine-tuning on small datasets. In this paper, we firstly conduct an empirical study to fine-tune CodeBERT(a) on four code-related small datasets and observe the instability phenomenon. This could be induced by over-capacity and irrelevant features inherent in these large pre-trained code models with respective to those small datasets. To address this issue, we leverage variational information bottleneck to filter out irrelevant features when fine-tuning the models. The experiments demonstrate the out-performance of our method compared to standard fine-tuning and regularization method such as dropout and weight decay. We also experimentally study the stability of our method through varying dataset sizes. Our code and data are available at https://github.com/little-pikachu-hash/VIBCodeBERT.
更多
查看译文
关键词
pre-trained model, fine-tuning, small software engineering dataset, information bottleneck
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要