On the effectiveness of graph data augmentation for source code learning

Zeming Dong,Qiang Hu,Zhenya Zhang,Jianjun Zhao

KNOWLEDGE-BASED SYSTEMS（2024）

引用 0|浏览5

暂无评分

摘要

The methodology that employs deep learning to handle software engineering tasks, such as bug detection, is commonly referred to as source code learning. Given the inherent graph nature of source code, graph learning, bolstered by graph neural networks (GNNs), has seen an increasing adoption in the domain of source code learning. Similar to other contexts within deep learning, source code learning also relies on extensive high-quality training data, and the scarcity of such data has become a primary impediment that leads to performance bottlenecks. In practice, data augmentation is often used as a countermeasure to mitigate this issue, by synthesizing additional training data based on existing ones. However, most existing practice of data augmentation in source code learning is limited to simple program transformation methods, such as code refactoring, thus not sufficiently effective. In this work, in light of the graph nature of source code, we propose to apply the data augmentation methods used for graph-structured data in graph learning to the tasks of source code learning, and we conduct a comprehensive empirical study to evaluate whether such new data augmentation approaches bring better effectiveness, in terms of producing more accurate and robust models. Specifically, we evaluate four critical software engineering tasks and seven neural network architectures to assess the effectiveness of five data augmentation methods. Experimental results identify that, compared to the data augmentation-free training approach, the Manifold-Mixup method can significantly improve both the accuracy and robustness of the trained models of source code learning, for up to 1.60% and 4.09%, respectively.

查看译文

关键词

Graph neural networks,Date augmentation,Source code analysis

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要