S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models
arxiv(2023)
摘要
The rapid development of Large Language Models (LLMs) has led to great
strides in model capabilities like long-context understanding and reasoning.
However, as LLMs are able to process longer contexts, it becomes more
challenging to evaluate whether they have acquired certain capabilities, since
the length of text (e.g., 200K tokens) they can process far exceeds what humans
can reliably assess in a reasonable duration. In this paper, we propose using
complex synthetic tasks as a proxy evaluation method, and present S3Eval, a
Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. The
synthetic nature of S3Eval provides users full control over the dataset,
allowing them to systematically probe LLM capabilities by scaling text length
and varying task difficulty across diverse scenarios. The strong correlation
between S3Eval and real-world benchmarks demonstrates the soundness of using
S3Eval for evaluation of LLMs. S3Eval provides a flexible and infinite
long-context data generation method. We have generated a comprehensive dataset
called S3Eval-Standard, and experimental results have shown that it poses
significant challenges for all existing LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要