TrustSQL: A Reliability Benchmark for Text-to-SQL Models with Diverse Unanswerable Questions
CoRR(2024)
摘要
Recent advances in large language models (LLMs) have led to significant
improvements in translating natural language questions into SQL queries. While
achieving high accuracy in SQL generation is crucial, little is known about the
extent to which these text-to-SQL models can reliably handle diverse types of
questions encountered during real-world deployment, including unanswerable
ones. To explore this aspect, we present TrustSQL, a new benchmark designed to
assess the reliability of text-to-SQL models in both single-database and
cross-database settings. The benchmark tasks models with providing one of two
outcomes: 1) SQL prediction; or 2) abstention from making a prediction, either
when there is a potential error in the generated SQL or when faced with
unanswerable questions. For model evaluation, we explore various modeling
approaches specifically designed for this task. These include: 1) optimizing
separate models for answerability detection, SQL generation, and error
detection, which are then integrated into a single pipeline; and 2) developing
a unified approach that optimizes a single model to address the proposed task.
Experimental results using our new reliability score show that addressing this
challenge involves many different areas of research and opens new avenues for
model development. Nonetheless, none of the methods surpass the reliability
performance of the naive baseline, which abstains from answering all questions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要