Calibrating Long-form Generations from Large Language Models
CoRR(2024)
摘要
To enhance Large Language Models' (LLMs) reliability, calibration is
essential – the model's assessed confidence scores should align with the
actual likelihood of its responses being correct. However, current confidence
elicitation methods and calibration metrics typically rely on a binary
true/false assessment of response correctness. This approach does not apply to
long-form generation, where an answer can be partially correct. Addressing this
gap, we introduce a unified calibration framework, in which both the
correctness of the LLMs' responses and their associated confidence levels are
treated as distributions across a range of scores. Within this framework, we
develop three metrics to precisely evaluate LLM calibration and further propose
two confidence elicitation methods based on self-consistency and
self-evaluation. Our experiments, which include long-form QA and summarization
tasks, demonstrate that larger models don't necessarily guarantee better
calibration, that calibration performance is found to be metric-dependent, and
that self-consistency methods excel in factoid datasets. We also find that
calibration can be enhanced through techniques such as fine-tuning, integrating
relevant source documents, scaling the temperature, and combining
self-consistency with self-evaluation. Lastly, we showcase a practical
application of our system: selecting and cascading open-source models and
ChatGPT to optimize correctness given a limited API budget. This research not
only challenges existing notions of LLM calibration but also offers practical
methodologies for improving trustworthiness in long-form generation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要