Fine grained image recognition algorithm based on CNN-Transformer and paired interaction

2024 4th International Conference on Neural Networks, Information and Communication (NNICE)(2024)

引用 0|浏览4
A new fine-grained image recognition algorithm (P-ResViT, Paired Interaction Res50-ViT) based on CNN-Transformer is proposed to address the issue of existing fine-grained image recognition methods based on convolutional neural networks (CNN) lacking attention mechanisms and difficulty in accurately locating key regions, as well as weak ability to extract contextual information and global features. This algorithm introduces paired interaction attention mechanisms to supplement external attention. This algorithm can utilize the self attention mechanism of the Vision Transformer and the external attention mechanism formed by paired image inputs during training to achieve precise positioning, while preserving the inductive bias and local feature extraction capabilities of CNN. Firstly, the fine-tuning improved ResNet-50 completes local feature extraction and obtains high-resolution feature maps. Secondly, the Transformer encodes the tokenized image blocks of the feature map as input sequences for extracting global context. Finally, the Transformer completes global feature extraction and outputs the classification sequence. In addition, two external attention mechanisms were introduced by inputting image pairs during training, including collaborative attention mechanisms for similar images and cross attention mechanisms for heterogeneous images, fully utilizing the correlation information between images. After experimental evaluation, the P-ResViT proposed in this article has reached advanced levels in three fine-grained image datasets CUB, CAR, and AIR.
Deep learning,Fine grained image recognition,Vision Transformer,Convolutional neural networks,Attention mechanism,Paired image input
AI 理解论文
Chat Paper