Are large language models superhuman chemists?
arxiv(2024)
摘要
Large language models (LLMs) have gained widespread interest due to their
ability to process human language and perform tasks on which they have not been
explicitly trained. This is relevant for the chemical sciences, which face the
problem of small and diverse datasets that are frequently in the form of text.
LLMs have shown promise in addressing these issues and are increasingly being
harnessed to predict chemical properties, optimize reactions, and even design
and conduct experiments autonomously. However, we still have only a very
limited systematic understanding of the chemical reasoning capabilities of
LLMs, which would be required to improve models and mitigate potential harms.
Here, we introduce "ChemBench," an automated framework designed to rigorously
evaluate the chemical knowledge and reasoning abilities of state-of-the-art
LLMs against the expertise of human chemists. We curated more than 7,000
question-answer pairs for a wide array of subfields of the chemical sciences,
evaluated leading open and closed-source LLMs, and found that the best models
outperformed the best human chemists in our study on average. The models,
however, struggle with some chemical reasoning tasks that are easy for human
experts and provide overconfident, misleading predictions, such as about
chemicals' safety profiles. These findings underscore the dual reality that,
although LLMs demonstrate remarkable proficiency in chemical tasks, further
research is critical to enhancing their safety and utility in chemical
sciences. Our findings also indicate a need for adaptations to chemistry
curricula and highlight the importance of continuing to develop evaluation
frameworks to improve safe and useful LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要