DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models
CoRR(2024)
摘要
Evaluating the quality and variability of text generated by Large Language
Models (LLMs) poses a significant, yet unresolved research challenge.
Traditional evaluation methods, such as ROUGE and BERTScore, which measure
token similarity, often fail to capture the holistic semantic equivalence. This
results in a low correlation with human judgments and intuition, which is
especially problematic in high-stakes applications like healthcare and finance
where reliability, safety, and robust decision-making are highly critical. This
work proposes DCR, an automated framework for evaluating and improving the
consistency of LLM-generated texts using a divide-conquer-reasoning approach.
Unlike existing LLM-based evaluators that operate at the paragraph level, our
method employs a divide-and-conquer evaluator (DCE) that breaks down the
paragraph-to-paragraph comparison between two generated responses into
individual sentence-to-paragraph comparisons, each evaluated based on
predefined criteria. To facilitate this approach, we introduce an automatic
metric converter (AMC) that translates the output from DCE into an
interpretable numeric score. Beyond the consistency evaluation, we further
present a reason-assisted improver (RAI) that leverages the analytical reasons
with explanations identified by DCE to generate new responses aimed at reducing
these inconsistencies. Through comprehensive and systematic empirical analysis,
we show that our approach outperforms state-of-the-art methods by a large
margin (e.g., +19.3
consistency of LLM generation across multiple benchmarks in semantic, factual,
and summarization consistency tasks. Our approach also substantially reduces
nearly 90
hallucination mitigation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要