SIMMC-VR: A Task-oriented Multimodal Dialog Dataset with Situated and Immersive VR Streams


引用 0|浏览42
Building an AI assistant that can seamlessly converse and instruct humans, in a user-centric situated scenario, requires several essential abilities:(1) spatial and temporal understanding of the situated and real-time user scenes,(2) capability of grounding the actively perceived visuals of users to conversation contexts,and (3) conversational reasoning over past utterances to perform just-in-time assistance.However, we currently lack a large-scale benchmark that captures user–assistant interactions with all of the aforementioned features.To this end, we propose SIMMC-VR, an extension of the SIMMC-2.0 dataset, to a video-grounded task-oriented dialog dataset that captures real-world AI-assisted user scenarios in VR.We propose a novel data collection paradigm that involves(1) generating object-centric multimodal dialog flows with egocentric visual streams and visually-grounded templates,and (2) manually paraphrasing the simulated dialogs for naturalness and diversity while preserving multimodal dependencies. To measure meaningful progress in the field, we propose four tasks to address the new challenges in SIMMC-VR, which require complex spatial-temporal dialog reasoning in active egocentric scenes.We benchmark the proposed tasks with strong multimodal models, and highlight the key capabilities that current models lack for future research directions.
AI 理解论文