Chain-of-Look Prompting for Verb-centric Surgical Triplet Recognition in Endoscopic Videos

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览7
暂无评分
摘要
Surgical triplet recognition aims to recognize surgical activities as triplets (i.e.,), which provides fine-grained information essential for surgical scene understanding. Existing methods for surgical triplet recognition rely on compositional methods that recognize the instrument, verb, and target simultaneously. In contrast, our method, called chain-of-look prompting, casts the problem of surgical triplet recognition as visual prompt generation from large-scale vision-language (VL) models, and explicitly decomposes the task into a series of video reasoning processes. Chain-of-Look prompting is inspired by: (1) the chain-of-thought prompting in natural language processing, which divides a problem into a sequence of intermediate reasoning steps; (2) the inter-dependency between motion and visual appearance in the human vision system. Since surgical activities are conveyed by the actions of physicians, we regard the verbs as the carrier of semantics in surgical endoscopic videos. Additionally, we utilize the BioMed large language model to calibrate the generated visual prompt features for surgical scenarios. Our approach captures the visual reasoning processes underlying surgical activities and achieves better performance compared to the state-of-the-art methods on the largest surgical triplet recognition dataset, CholecT50. The code is available at https://github.com/southnx/CoLSurgical.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要