DebugBench: Evaluating Debugging Capability of Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) have demonstrated exceptional coding capability.
However, as another critical component of programming proficiency, the
debugging capability of LLMs remains relatively unexplored. Previous
evaluations of LLMs' debugging ability are significantly limited by the risk of
data leakage, the scale of the dataset, and the variety of tested bugs. To
overcome these deficiencies, we introduce `DebugBench', an LLM debugging
benchmark consisting of 4,253 instances. It covers four major bug categories
and 18 minor types in C++, Java, and Python. To construct DebugBench, we
collect code snippets from the LeetCode community, implant bugs into source
data with GPT-4, and assure rigorous quality checks. We evaluate two commercial
and three open-source models in a zero-shot scenario. We find that (1) while
closed-source models like GPT-4 exhibit inferior debugging performance compared
to humans, open-source models such as Code Llama fail to attain any pass rate
scores; (2) the complexity of debugging notably fluctuates depending on the bug
category; (3) incorporating runtime feedback has a clear impact on debugging
performance which is not always helpful. As an extension, we also compare LLM
debugging and code generation, revealing a strong correlation between them for
closed-source models. These findings will benefit the development of LLMs in
debugging.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要