StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
arxiv(2024)
摘要
Large Language Models (LLMs) have witnessed remarkable advancements in recent
years, prompting the exploration of tool learning, which integrates LLMs with
external tools to address diverse real-world challenges. Assessing the
capability of LLMs to utilise tools necessitates large-scale and stable
benchmarks. However, previous works relied on either hand-crafted online tools
with limited scale, or large-scale real online APIs suffering from instability
of API status. To address this problem, we introduce StableToolBench, a
benchmark evolving from ToolBench, proposing a virtual API server and stable
evaluation system. The virtual API server contains a caching system and API
simulators which are complementary to alleviate the change in API status.
Meanwhile, the stable evaluation system designs solvable pass and win rates
using GPT-4 as the automatic evaluator to eliminate the randomness during
evaluation. Experimental results demonstrate the stability of StableToolBench,
and further discuss the effectiveness of API simulators, the caching system,
and the evaluator system.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要