LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
CoRR(2024)
摘要
State-of-the-art large language models (LLMs) are now claiming remarkable
supported context lengths of 256k or even more. In contrast, the average
context lengths of mainstream benchmarks are insufficient (5k-21k), and they
suffer from potential knowledge leakage and inaccurate metrics, resulting in
biased evaluation. This paper introduces LV-Eval, a challenging long-context
benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up
to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA,
comprising 11 bilingual datasets. The design of LV-Eval has incorporated three
key techniques, namely confusing facts insertion, keyword and phrase
replacement, and keyword-recall-based metric design. The advantages of LV-Eval
include controllable evaluation across different context lengths, challenging
test instances with confusing facts, mitigated knowledge leakage, and more
objective evaluations. We evaluate 10 LLMs on LV-Eval and conduct ablation
studies on the techniques used in LV-Eval construction. The results reveal
that: (i) Commercial LLMs generally outperform open-source LLMs when evaluated
within length levels shorter than their claimed context length. However, their
overall performance is surpassed by open-source LLMs with longer context
lengths. (ii) Extremely long-context LLMs, such as Yi-6B-200k, exhibit a
relatively gentle degradation of performance, but their absolute performances
may not necessarily be higher than those of LLMs with shorter context lengths.
(iii) LLMs' performances can significantly degrade in the presence of confusing
information, especially in the pressure test of "needle in a haystack". (iv)
Issues related to knowledge leakage and inaccurate metrics introduce bias in
evaluation, and these concerns are alleviated in LV-Eval. All datasets and
evaluation codes are released at: https://github.com/infinigence/LVEval.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要