Towards a change taxonomy for machine learning pipelines

EMPIRICAL SOFTWARE ENGINEERING(2023)

引用 0|浏览11
暂无评分
摘要
Machine Learning (ML) academic publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend the ML algorithms, data sets and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively, by building on Hindle et al.’s seminal taxonomy of code changes. We found that while ML research repositories are heavily forked, only 9% of the forks made modifications to the forked repository. 42% of the latter sent changes to the parent repositories, half of which (52%) were accepted by the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes extends Hindle et al.’s taxonomy with two new top-level change categories related to ML ( Data and Dependency Management ), and 16 new sub-categories, including nine ML-specific ones ( input data, parameter tuning, pre-processing, training infrastructure, model structure, pipeline performance, sharing, validation infrastructure, and output data ). While the changes that are not contributed back by the forks mostly concern domain-specific features and local experimentation (e.g., parameter tuning ), the origin repositories do miss out on a non-trivial 15.4% of Documentation changes, 13.6% of Feature changes and 11.4% of Bug fix changes.
更多
查看译文
关键词
Machine learning,Change taxonomy,GitHub collaborations,Contribution management
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要