CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
CoRR(2023)
摘要
Large Language Models (LLMs) have demonstrated remarkable performance on
coding related tasks, particularly on assisting humans in programming and
facilitating programming automation. However, existing benchmarks for
evaluating the code understanding and generation capacities of LLMs suffer from
severe limitations. First, most benchmarks are deficient as they focus on a
narrow range of popular programming languages and specific tasks, whereas the
real-world software development scenarios show dire need to implement systems
with multilingual programming environments to satisfy diverse requirements.
Practical programming practices also strongly expect multi-task settings for
testing coding capabilities of LLMs comprehensively and robustly. Second, most
benchmarks also fail to consider the actual executability and the consistency
of execution results of the generated code. To bridge these gaps between
existing benchmarks and expectations from practical applications, we introduce
CodeScope, an execution-based, multilingual, multi-task, multi-dimensional
evaluation benchmark for comprehensively gauging LLM capabilities on coding
tasks. CodeScope covers 43 programming languages and 8 coding tasks. It
evaluates the coding performance of LLMs from three dimensions (perspectives):
difficulty, efficiency, and length. To facilitate execution-based evaluations
of code generation, we develop MultiCodeEngine, an automated code execution
engine that supports 14 programming languages. Finally, we systematically
evaluate and analyze 8 mainstream LLMs on CodeScope tasks and demonstrate the
superior breadth and challenges of CodeScope for evaluating LLMs on code
understanding and generation tasks compared to other benchmarks. The CodeScope
benchmark and datasets are publicly available at
https://github.com/WeixiangYAN/CodeScope.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要