From coarse to fine: multi-level feature fusion network for fine-grained image retrieval

Multimedia Systems(2022)

Fine-grained image retrieval (FGIR) has received extensive attention in academia and industry. Despite the tremendous progress, the issue of large intra-class differences and small inter-class differences is still open. Existing fine-grained image classification works, similar to FGIR, focus on learning discriminative local features to solve the above-motioned challenge. Based on this observation, it is unreasonable to use only the global features(i.e. object features or image features) and ignore the discriminable local features(i.e., patch features) for FGIR. In this paper, we propose a novel coarse-to-fine multiple-level feature fusion network (MFFN) that conquers the problem described above via utilizing multi-level features extracting and fusion. MFFN first adopts object-level features for coarse retrieval, a step that reduces the scope of the retrieval. For the fine retrieval stage, we designed the converged multi-level features to deeply mine the intrinsic correlation and complementary information between patch-level and image-level features through a deep belief network (DBN). In addition, for patch-level features, we designed a new constraint to select discriminative patches and proposed a weighted max-polling method to aggregate these distinguishing patches. We achieve the new state-of-the-art performance of the proposed framework on widely-used benchmarks, including CUB-200-2011 and Oxford-Flower-102 datasets.
Convolutional neural network,Multi-level feature fusion,Fine-grained image retrieval
