CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification
arxiv(2024)
摘要
The CLIP (Contrastive Language-Image Pretraining) model has exhibited
outstanding performance in recognition problems, such as zero-shot image
classification and object detection. However, its ability to count remains
understudied due to the inherent challenges of transforming counting–a
regression task–into a recognition task. In this paper, we investigate CLIP's
potential in counting, focusing specifically on estimating crowd sizes.
Existing classification-based crowd-counting methods have encountered issues,
including inappropriate discretization strategies, which impede the application
of CLIP and result in suboptimal performance. To address these challenges, we
propose the Enhanced Blockwise Classification (EBC) framework. In contrast to
previous methods, EBC relies on integer-valued bins that facilitate the
learning of robust decision boundaries. Within our model-agnostic EBC
framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting
model capable of generating density maps. Comprehensive evaluations across
diverse crowd-counting datasets demonstrate the state-of-the-art performance of
our methods. Particularly, EBC can improve existing models by up to 76.9
Moreover, our CLIP-EBC model surpasses current crowd-counting methods,
achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part
B datasets, respectively. The code will be made publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要