Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study.

International Conference on Software Quality, Reliability and Security(2023)

引用 0|浏览5
暂无评分
摘要
Recent studies have shown surprising results of source code learning, which applies deep neural networks (DNNs) to various software engineering tasks. Like other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. In practice, data augmentation is a technique that produces additional training data to boost the model training and has been widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, based on the insight that source code can be represented sequentially as text data, we take an early step to investigate whether data augmentation methods originally for texts are effective for source code learning. To that end, we focus on code classification tasks and conduct a comprehensive empirical study on four critical code problems and four DNN architectures to assess the effectiveness of 8 data augmentation methods. Our results identify the data augmentation methods that can produce more accurate models for source code learning and show that the data augmentation methods are still useful even if they slightly break the syntax of source code.
更多
查看译文
关键词
Data Augmentation,Source Code Analysis,Program Transformation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要