Recognize Any Regions.
CoRR(2023)
摘要
Understanding the semantics of individual regions or patches within
unconstrained images, such as in open-world object detection, represents a
critical yet challenging task in computer vision. Building on the success of
powerful image-level vision-language (ViL) foundation models like CLIP, recent
efforts have sought to harness their capabilities by either training a
contrastive model from scratch with an extensive collection of region-label
pairs or aligning the outputs of a detection model with image-level
representations of region proposals. Despite notable progress, these approaches
are plagued by computationally intensive training requirements, susceptibility
to data noise, and deficiency in contextual information. To address these
limitations, we explore the synergistic potential of off-the-shelf foundation
models, leveraging their respective strengths in localization and semantics. We
introduce a novel, generic, and efficient region recognition architecture,
named RegionSpot, designed to integrate position-aware localization knowledge
from a localization foundation model (e.g., SAM) with semantic information
extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge
while minimizing training overhead, we keep both foundation models frozen,
focusing optimization efforts solely on a lightweight attention-based knowledge
integration module. Through extensive experiments in the context of open-world
object recognition, our RegionSpot demonstrates significant performance
improvements over prior alternatives, while also providing substantial
computational savings. For instance, training our model with 3 million data in
a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean
average precision (mAP), with an even larger margin by 14.8 % for more
challenging and rare categories.
更多查看译文
关键词
regions
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要