Unveiling Memorization in Code Models
Proceedings of the IEEE/ACM 46th International Conference on Software Engineering(2023)
摘要
The availability of large-scale datasets, advanced architectures, and
powerful computational resources have led to effective code models that
automate diverse software engineering activities. The datasets usually consist
of billions of lines of code from both open-source and private repositories. A
code model memorizes and produces source code verbatim, which potentially
contains vulnerabilities, sensitive information, or code with strict licenses,
leading to potential security and privacy issues. This paper investigates an
important problem: to what extent do code models memorize their training data?
We conduct an empirical study to explore memorization in large pre-trained code
models. Our study highlights that simply extracting 20,000 outputs (each having
512 tokens) from a code model can produce over 40,125 code snippets that are
memorized from the training data. To provide a better understanding, we build a
taxonomy of memorized contents with 3 categories and 14 subcategories. The
results show that the prompts sent to the code models affect the distribution
of memorized contents. We identify several key factors of memorization.
Specifically, given the same architecture, larger models suffer more from
memorization problems. A code model produces more memorization when it is
allowed to generate longer outputs. We also find a strong positive correlation
between the number of an output's occurrences in the training data and that in
the generated outputs, which indicates that a potential way to reduce
memorization is to remove duplicates in the training data. We then identify
effective metrics that infer whether an output contains memorization
accurately. We also make suggestions to deal with memorization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要