A study on identifying code author from real development.

ACM SIGSOFT Conference on the Foundations of Software Engineering (FSE)(2022)

引用 1|浏览4
暂无评分
摘要
Identifying code authors is important in many research topics, and various approaches have been proposed. Although these approaches achieve promising results on their datasets, their true effectiveness is still in question. To the best of our knowledge, only one large-scale study was conducted to explore the impacts of related factors ( e.g. , the temporal effect and the distribution of files per author). This study selected Google Code Jam programs as their subjects, but such programs are quite different from the source files that programmers write in daily development. To understand their effectiveness and challenges, we replicate their study and use their approach to analyze source files that are retrieved from real projects. The prior study claims that the temporal effect and the distribution of files per author have only minor impacts on their trained models. In the contrast, we find that in 85.48% pairs of training and testing sets, the accuracy of a trained model is less effective when the temporal effect is considered, and in total, the average accuracy decreases by 0.4298. In addition, when we use the real distribution of files as inputs, their approach can accurately identify only one or two core code authors, although a project can have more than ten authors. By revealing the limitations of the prior approach, our study sheds lights on where to make future improvements.
更多
查看译文
关键词
code author,development
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要