SMVT: Spectrum-Driven Multi-scale Vision Transformer for Referring Image Segmentation

Tianxiao Li,Junhong Chen, Yiheng Huang, Kesi Huang, Qiqiang Xia,Muhammad Asim,Wenyin Liu

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VI, ICIC 2024（2024）

Guangdong Univ Technol

Cited 0|Views5

Abstract

Referring image segmentation is a challenging task at the intersection of computer vision and natural language processing, aiming to segment out an object referred to by a natural language expression from an image. Recently despite significant progress on this task, existing methods still face challenges in effectively integrating visual and language information and enhancing the model's ability to capture fine-grained details within images. These challenges primarily originate from a lack of a mechanism capable of deeply and comprehensively fusing visual features with language features and effectively utilizing cross-modal features. To address these problems, we propose the Spectrum-driven Multi-scale Visual Transformer (SMVT), which incorporates two innovative designs: Spectrum-driven Fusion Attention (SFA) and the Cross-modal Feature Refinement Enhancement (CFRE) module. SFA, by guiding the fusion of visual and linguistic features at the spectral domain level, effectively captures fine-grained features in images and enhances the model's sensitivity to local spectral domain information, thereby responding more accurately to the detail requirements in language descriptions. CFRE module, by refining and enhancing cross-modal features at different layers, enhances the complementarity and the ability to capture fine-grained cross-modal features across different layers, promoting the precise alignment of visual and language features. These two modules enable the SMVT to more effectively process visual and language information. Experiments on three benchmark datasets have shown that our method surpasses state-of-the-art approaches.

Translated text

Key words

Referring image segmentation,Cross-modal learning,Spectrum-driven fusion

求助PDF

上传PDF

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Upload PDF to Generate Summary

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined