Changeset-Based Topic Modeling of Software Repositories

IEEE Transactions on Software Engineering(2020)

引用 18|浏览42
暂无评分
摘要
The standard approach to applying text retrieval models to code repositories is to train models on documents representing program elements. However, code changes lead to model obsolescence and to the need to retrain the model from the latest snapshot. To address this, we previously introduced an approach that trains a model on documents representing changesets from a repository and demonstrated its feasibility for feature location. In this paper, we expand our work by investigating: a second task (developer identification), the effects of including different changeset parts in the model, the repository characteristics that affect the accuracy of our approach, and the effects of the time invariance assumption on evaluation results. Our results demonstrate that our approach is as accurate as the standard approach for projects with most changes localized to a subset of the code, but less accurate when changes are highly distributed throughout the code. Moreover, our results demonstrate that context and messages are key to the accuracy of changeset-based models and that the time invariance assumption has a statistically significant effect on evaluation results, providing overly-optimistic results. Our findings indicate that our approach is a suitable alternative to the standard approach, providing comparable accuracy while eliminating retraining costs.
更多
查看译文
关键词
Task analysis,Standards,Feature extraction,Resource management,Software maintenance,Maintenance engineering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要