How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
arxiv(2024)
摘要
Decision-making, a complicated task requiring various types of abilities,
presents an excellent framework for assessing Large Language Models (LLMs). Our
research investigates LLMs' decision-making capabilities through the lens of a
well-established field, Game Theory. We focus specifically on games that
support the participation of more than two agents simultaneously. Subsequently,
we introduce our framework, GAMA-Bench, including eight classical multi-agent
games. We design a scoring scheme to assess a model's performance in these
games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness,
generalizability, and enhancement strategies. Results reveal that while GPT-3.5
shows satisfying robustness, its generalizability is relatively limited.
However, its performance can be improved through approaches such as
Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and
find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of
72.5. Moreover, the increasingly higher scores across the three iterations of
GPT-3.5 (0613, 1106, 0125) demonstrate marked advancements in the model's
intelligence with each update. The code and experimental results are made
publicly available via https://github.com/CUHK-ARISE/GAMABench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要