CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code
CoRR(2023)
摘要
Motivated by recent work on lifelong learning applications for language
models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused
on code changes. Our contribution addresses a notable research gap marked by
the absence of a long-term temporal dimension in existing code change datasets,
limiting their suitability in lifelong learning scenarios. In contrast, our
dataset aims to comprehensively capture code changes across the entire release
history of open-source software repositories. In this work, we introduce an
initial version of CodeLL, comprising 71 machine-learning-based projects mined
from Software Heritage. This dataset enables the extraction and in-depth
analysis of code changes spanning 2,483 releases at both the method and API
levels. CodeLL enables researchers studying the behaviour of LMs in lifelong
fine-tuning settings for learning code changes. Additionally, the dataset can
help studying data distribution shifts within software repositories and the
evolution of API usages over time.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要