ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use.
Computing Research Repository (CoRR)(2025)
Abstract
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04 underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
MoreTranslated text
PDF
View via Publisher
AI Read Science
Video&Figures
Log in,to view the remaining content
论文作者介绍
The authors of this paper include Ye Junjie, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Chen Zehui, Zaiyuan Wang, Sining Zhu, Xi Zhiheng, Yuan Siyu, Gui Tao, Zhang Qi, Huang Xuanjing, Jiecao Chen. They come from various institutions such as the School of Computer Science, School of Automation, Institute of Modern Languages, School of Data Science, and the Natural Language Processing Laboratory, with research directions covering topics such as topic models, language models, sentiment analysis, object detection, domain adaptation, transfer learning, and dependency parsing.
文献大纲
1. Introduction
- Multi-hop tool use poses significant challenges to the understanding, reasoning, and function calling abilities of large language models (LLMs).
- Existing evaluation methods have limitations, such as the lack of reliable evaluation datasets and dependencies among tools.
- This paper introduces ToolHop, a benchmark dataset specifically designed for evaluating the multi-hop tool use capabilities of LLMs.
2. ToolHop
- Task Definition: Given a multi-hop query and a set of tools, the LLM needs to select and call the appropriate tools and ultimately provide an answer.
- Query-Driven Data Construction:
- Tool Creation: Decompose the query into atomic sub-queries and create tool documents for each sub-query.
- Document Refinement: Extend the functionality of the tool documents by increasing the number and types of parameters.
- Code Generation: Generate executable code based on the tool documents.
- Dataset Analysis:
- Diverse Queries: Covering 47 different domains.
- Meaningful Dependencies: Each query requires 3-7 tools, emphasizing the importance of multi-hop reasoning.
- Locally Executable Tools: 3,912 locally deployable and directly executable tools.
- Detailed Feedback: Provides correct and incorrect outputs, as well as error messages.
- Verifiable Answers: Predefined queries and answers for convenient validation of model outputs.
3. Experimental Setup
- Models: Evaluate 14 LLMs from five families, including LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT.
- Implementation Details: Use GPT-4o to assist in data processing and implement tool usage through the function calling interface of LLMs.
4. Main Results
- Evaluation Dimensions: Answer correctness and calling errors.
- Evaluation Observations:
- Tool use significantly improves the correctness of LLM answers but still has room for improvement.
- Different LLM families show varying characteristics in tool use.
- Larger models typically exhibit better tool use capabilities.
5. Further Research
- Analyze the differences among different LLM families in multi-hop tool use.
- Propose suggestions for improving the tool use capabilities of LLMs.
6. Related Work
- LLM Tool Use: Reviews the current state of research on LLM tool use and emphasizes the importance of dataset construction.
- Tool Use Evaluation: Discusses existing evaluation methods and points out their limitations.
7. Conclusion
- ToolHop is a benchmark dataset for evaluating the multi-hop tool use capabilities of LLMs.
- The evaluation results of ToolHop show that LLMs still have room for improvement in multi-hop tool use.
- ToolHop provides a foundation for research and improvement of LLM tool use capabilities.
关键问题
工具
### Q: What specific research methods were used in the paper? - **Dataset Construction**: The paper proposes a dataset named ToolHop for evaluating the capabilities of large language models (LLM) in multi-hop tool usage scenarios. The dataset consists of 995 multi-hop queries and 3,912 locally executable tools, generated through a query-driven data construction approach. - **Experimental Evaluation**: The paper evaluates 14 LLMs from five model families, including LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT, using the ToolHop dataset. The evaluation metrics include answer correctness rate and tool invocation error rate. - **Case Analysis**: The paper analyzes the performance differences of different LLM families in multi-hop tool usage scenarios and explores the underlying reasons. ### Q: What are the main research findings and achievements? - **ToolHop Dataset**: The proposed ToolHop dataset effectively evaluates LLMs' capabilities in multi-hop tool usage scenarios and has the following characteristics: - **Diverse Queries**: The dataset covers 47 different domains, allowing the evaluation of LLMs' tool usage abilities in various scenarios. - **Meaningful Tool Dependencies**: The dataset contains meaningful dependencies among tools, simulating real multi-hop reasoning processes. - **Locally Executable Tools**: The tools in the dataset are locally executable, providing a more realistic evaluation environment. - **Detailed Feedback**: The dataset provides detailed tool invocation feedback, aiding LLMs in error correction. - **Verifiable Answers**: The dataset predefines the answers to queries, facilitating the evaluation of LLMs' answer correctness rate. - **LLM Performance Evaluation**: The paper finds that even the most advanced LLMs have considerable room for improvement in their performance in multi-hop tool usage scenarios. For instance, GPT-4's answer correctness rate is only 49.04% in scenarios where tools are强制 used. - **LLM Tool Usage Strategies**: The paper finds that different LLM families have varying strategies for tool usage. For example, models in the Qwen2.5 family tend to invoke tools in parallel, while models in the GPT family are better at leveraging tool feedback to improve performance. ### Q: What are the current limitations of this research? - **Lack of Strategies to Improve Tool Usage**: Although the paper evaluates LLMs' capabilities in multi-hop tool usage scenarios, it does not propose specific strategies to enhance their tool usage abilities. - **Limited Dataset Scale**: The ToolHop dataset is relatively small and may not fully represent real-world tool usage scenarios. - **Single Evaluation Metric**: The paper primarily focuses on answer correctness rate and tool invocation error rate, which may not comprehensively evaluate LLMs' tool usage capabilities.
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
去 AI 文献库 对话