Image-Text Connection: Exploring the Expansion of the Diversity Within Joint Feature Space Similarity Scores

Mahsa Mohammadi,Mahdi Eftekhari, Amirhossein Hassani

IEEE Access（2023）

引用 0|浏览0

暂无评分

摘要

Cross-modal representation learning aims to learn a shared representation space where data from multiple modalities can be effectively compared, fused, and understood. This paper investigates the role of increased diversity in the similarity score matrix in enhancing the performance of the CLIP (Contrastive Language-Image Pretraining), a multi-modal learning model that establishes a connection between images and text within a joint embedding space. Two transforming approaches, sine and sigmoid (including two versions), are incorporated into the CLIP model to amplify larger values and diminish smaller values within the similarity matrix (logits). Hardware limitations are addressed using a more compact text encoder (DistilBERT) and a pre-trained ResNet50 image encoder. The proposed adaptations are evaluated on various benchmarks, including image classification and image/text retrieval tasks, using 10 benchmark datasets such as Food101, Flickr30k, and COCO. The performance of the adapted models is compared to the base CLIP model using Accuracy, mean per class, and Recall@k metrics. The results demonstrate improvements in Accuracy (up to 5.32% enhancement for the PatchCamelyon dataset), mean per class (up to 14.48% enhancement for the FGVCAircraft dataset), and retrieval precision (with an increase of up to 45.20% in Recall@1 for the COCO dataset), compared to the baseline algorithm (CLIP).

查看译文

关键词

Adaptation models,Transformers,Joining processes,Computational modeling,Visualization,Task analysis,Representation learning,Information retrieval,Text mining,CLIP,cosine similarity matrix,diversity,dual-modal,image classification,image/text retrieval,joint embedding space

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要