CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
arxiv(2024)
摘要
Evaluating the alignment of large language models (LLMs) with user-defined
coding preferences is a challenging endeavour that requires assessing intricate
textual LLMs' outputs. By relying on automated metrics and static analysis
tools, existing benchmarks fail to assess nuances in user instructions and LLM
outputs, highlighting the need for large-scale datasets and benchmarks for LLM
preference alignment. In this paper, we introduce CodeUltraFeedback, a
preference dataset of 10,000 complex instructions to tune and align LLMs to
coding preferences through AI feedback. We generate responses to the
instructions using a pool of 14 diverse LLMs, which we then annotate according
to their alignment with five coding preferences using the LLM-as-a-Judge
approach with GPT-3.5, producing both numerical and textual feedback. We also
present CODAL-Bench, a benchmark for assessing LLM alignment with these coding
preferences. Our results show that CodeLlama-7B-Instruct, aligned through
reinforcement learning from AI feedback (RLAIF) with direct preference
optimization (DPO) using CodeUltraFeedback's AI feedback data, outperforms 34B
LLMs on CODAL-Bench, validating the utility of CodeUltraFeedback for preference
tuning. Furthermore, we show our DPO-aligned CodeLlama model improves
functional correctness on HumanEval+ compared to the unaligned base model.
Therefore, our contributions bridge the gap in preference tuning of LLMs for
code and set the stage for further advancements in model alignment and RLAIF
for code intelligence. Our code and data are available at
https://github.com/martin-wey/CodeUltraFeedback.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要