∞Bench: Extending Long Context Evaluation Beyond 100K Tokens
CoRR(2024)
摘要
Processing and reasoning over long contexts is crucial for many practical
applications of Large Language Models (LLMs), such as document comprehension
and agent construction. Despite recent strides in making LLMs process contexts
with more than 100K tokens, there is currently a lack of a standardized
benchmark to evaluate this long-context capability. Existing public benchmarks
typically focus on contexts around 10K tokens, limiting the assessment and
comparison of LLMs in processing longer contexts. In this paper, we propose
∞Bench, the first LLM benchmark featuring an average data length
surpassing 100K tokens. ∞Bench comprises synthetic and realistic tasks
spanning diverse domains, presented in both English and Chinese. The tasks in
∞Bench are designed to require well understanding of long dependencies
in contexts, and make simply retrieving a limited number of passages from
contexts not sufficient for these tasks. In our experiments, based on
∞Bench, we evaluate the state-of-the-art proprietary and open-source
LLMs tailored for processing long contexts. The results indicate that existing
long context LLMs still require significant advancements to effectively process
100K+ context. We further present three intriguing analyses regarding the
behavior of LLMs processing long context.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要