You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME(2023)

引用 0|浏览8
暂无评分
摘要
Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code and queries based on the according prompts and then propose a filtering mechanism which trains a cross-encoder from the backbone model UniXcoder to filter out code and query pairs with low matching scores. Finally, we re-train the backbone model using the obtained high-quality augmented data. Experimental results show that ChatDANCE achieves state-of-the-art performance, improving the best baseline by 13.2% (R@1) and 7% (MRR). Surprisingly, we find that this augment-filter-retrain strategy enables the backbone model (UniXcoder) to self-grow. Moreover, extensive experiments show the effectiveness of each component and ChatDANCE has stable performance under different hyperparameter settings. In addition, we conduct qualitative and quantitative analyses to investigate why ChatDANCE works well and find that it learns a more uniform distribution of representations and effectively aligns the code and query spaces. We have made the code and data anonymously available at https://anonymous.4open.science/r/ChatDANCE.
更多
查看译文
关键词
Code Search,Data Augmentation,ChatGPT
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要