Characterization and Prediction of Popular Projects on GitHub

2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC)(2019)

引用 23|浏览62
暂无评分
摘要
GitHub is a large and popular open source project platform, which hosts various open source projects. Despite the prevalence of GitHub platform, not every project has gained high popularity. Identification of popular projects on GitHub can help developers choose proper projects to follow or contribute to, as well as provide guidance in building a popular project. In this paper, we propose an approach to predict the popularity of GitHub projects. We first conducted online surveys with GitHub users to determine the threshold (the number of stars of a project) of popular and unpopular projects. Next, we extract 35 features from both GitHub and Stack Overflow, which are divided into three dimensions: project, evolutionary, and project owner. A random forest classifier is built based on these features to identify popular GitHub projects. To evaluate the performance of our approach, we collect a large-scale dataset from GitHub which contains a total of 409,784 GitHub projects and 174,784 GitHub users. Our model achieves an average AUC of 0.76, which statistically significantly improves state-of-the-art by a substantial margin. We also study which features are of the most importance in distinguishing popular projects from unpopular ones. Experimental results show that number of branches, number of open issues, and number of contributors play the most important roles in identification of popular projects, and all of them have large effect size.
更多
查看译文
关键词
Feature Engineering,GitHub Project,Popularity,Prediction Model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要