A three-stream fusion and self-differential attention network for multi-modal crowd counting

Pattern Recognition Letters(2024)

引用 0|浏览7
暂无评分
摘要
Multi-modal crowd counting aims at using multiple types of data, like RGB-Thermal and RGB-Depth, to count the number of people in crowded scenes. Current methods mainly focus on two-stream multi-modal information fusing in the encoder and single-scale semantic features in the decoder. In this paper, we propose an end-to-end three-stream fusion and self-differential attention network to simultaneously address the multi-modal fusion and scale variation problems for multi-modal crowd counting. Specifically, the encoder adopts three-stream fusion to fuse stage-wise modality-paired and modality-specific features. The decoder applies a self-differential attention mechanism on multi-level fused features to extract basic and differential information adaptively, and finally, the counting head predicts the density map. Experimental results on RGB-T and RGB-D benchmarks show the superiority of our proposed method compared with the state-of-the-art multi-modal crowd counting methods. Ablation studies and visualization demonstrate the advantages of the proposed modules in our model.
更多
查看译文
关键词
Crowd counting,Three-stream fusion,Self-differential attention,Multi-modal data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要