One-Stage Visual Grounding via Semantic-Aware Feature Filter

International Multimedia Conference(2021)

引用 15|浏览15
ABSTRACTVisual grounding has attracted much attention with the popularity of vision language. Existing one-stage methods are far ahead of two-stage methods in speed. However, these methods fuse the textual feature and visual feature map by simply concatenation, which ignores the textual semantics and limits these models' ability in cross-modal understanding. To overcome this weakness, we propose a semantic-aware framework that utilizes both queries' structured knowledge and context-sensitive representations to filter the visual feature maps to localize the referents more accurately. Our framework contains an entity filter, an attribute filter, and a location filter. These three filters filter the input visual feature map step by step according to each query's aspects respectively. A grounding module further regresses the bounding boxes to localize the referential object. Experiments on various commonly used datasets show that our framework achieves a real-time inference speed and outperforms all state-of-the-art methods.
AI 理解论文
Chat Paper