Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification
CoRR(2024)
摘要
Making inferences in text comprehension to understand the meaning is
essential in language processing. This work studies the entailment verification
(EV) problem of multi-sentence premises that requires a system to make multiple
inferences implicitly. Studying EV for such complex premises is important
because modern NLP problems, such as detecting inconsistent model-generated
rationales, require complex multi-hop reasoning. However, current textual
inference datasets mostly contain short premises that only partially focus on
these challenges. To address this, we compile an EV benchmark that includes
datasets from three NLP domains (NLI, contextual QA, and rationales) containing
multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are
better than humans in multi-hop reasoning across extended contexts, while
humans perform better in simple deductive reasoning tasks. We also finetune a
Flan-T5 model for EV using two training objectives to obtain a strong
open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use
this model to filter out inconsistent model-generated rationales in
self-consistency decoding, resulting in a 6
across three MCQ datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要