Self-Evaluation Improves Selective Generation in Large Language Models
CoRR(2023)
摘要
Safe deployment of large language models (LLMs) may benefit from a reliable
method for assessing their generated content to determine when to abstain or to
selectively generate. While likelihood-based metrics such as perplexity are
widely employed, recent research has demonstrated the limitations of using
sequence-level probability estimates given by LLMs as reliable indicators of
generation quality. Conversely, LLMs have demonstrated strong calibration at
the token level, particularly when it comes to choosing correct answers in
multiple-choice questions or evaluating true/false statements. In this work, we
reformulate open-ended generation tasks into token-level prediction tasks, and
leverage LLMs' superior calibration at the token level. We instruct an LLM to
self-evaluate its answers, employing either a multi-way comparison or a
point-wise evaluation approach, with the option to include a ``None of the
above'' option to express the model's uncertainty explicitly. We benchmark a
range of scoring methods based on self-evaluation and evaluate their
performance in selective generation using TruthfulQA and TL;DR. Through
experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based
scores not only improve accuracy, but also correlate better with the overall
quality of generated content.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要