Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
arxiv(2024)
摘要
Large language models (LLMs) have shown great potential in complex reasoning
tasks, yet their performance is often hampered by the scarcity of high-quality,
reasoning-focused training datasets. Addressing this challenge, we propose
Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that
synthesizes question-answer pairs by leveraging key points and exemplar pairs
from authentic data sources. KPDDS ensures the generation of novel questions
with rigorous quality control and substantial scalability. As a result, we
present KPMath, the most extensive synthetic dataset tailored for mathematical
reasoning to date, comprising over one million question-answer pairs. Utilizing
KPMath and augmenting it with additional reasoning-intensive corpora, we create
the comprehensive KPMath-Plus dataset. Fine-tuning the Mistral-7B model on
KPMath-Plus yields a zero-shot PASS@1 accuracy of 39.3
performance that not only outpaces other finetuned 7B models but also exceeds
that of certain 34B models. Our ablation studies further confirm the
substantial enhancement in mathematical reasoning across various subtopics,
marking a significant stride in LLMs' reasoning capabilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要