ChatterBox: Multi-round Multimodal Referring and Grounding
CoRR(2024)
摘要
In this study, we establish a baseline for a new task named multimodal
multi-round referring and grounding (MRG), opening up a promising direction for
instance-level multimodal dialogues. We present a new benchmark and an
efficient vision-language model for this purpose. The new benchmark, named
CB-300K, spans challenges including multi-round dialogue, complex spatial
relationships among multiple instances, and consistent reasoning, which are
beyond those shown in existing benchmarks. The proposed model, named
ChatterBox, utilizes a two-branch architecture to collaboratively handle vision
and language tasks. By tokenizing instance regions, the language branch
acquires the ability to perceive referential information. Meanwhile, ChatterBox
feeds a query embedding in the vision branch to a token receiver for visual
grounding. A two-stage optimization strategy is devised, making use of both
CB-300K and auxiliary external data to improve the model's stability and
capacity for instance-level understanding. Experiments show that ChatterBox
outperforms existing models in MRG both quantitatively and qualitatively,
paving a new path towards multimodal dialogue scenarios with complicated and
precise interactions. Code, data, and model are available at:
https://github.com/sunsmarterjie/ChatterBox.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要