SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
arxiv(2024)
摘要
This paper presents the results of the SHROOM, a shared task focused on
detecting hallucinations: outputs from natural language generation (NLG)
systems that are fluent, yet inaccurate. Such cases of overgeneration put in
jeopardy many NLG applications, where correctness is often mission-critical.
The shared task was conducted with a newly constructed dataset of 4000 model
outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine
translation, paraphrase generation and definition modeling.
The shared task was tackled by a total of 58 different users grouped in 42
teams, out of which 27 elected to write a system description paper;
collectively, they submitted over 300 prediction sets on both tracks of the
shared task. We observe a number of key trends in how this approach was tackled
– many participants rely on a handful of model, and often rely either on
synthetic data for fine-tuning or zero-shot prompting strategies. While a
majority of the teams did outperform our proposed baseline system, the
performances of top-scoring systems are still consistent with a random handling
of the more challenging items.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要