WeChat Mini Program
Old Version Features

SMVT: Spectrum-Driven Multi-scale Vision Transformer for Referring Image Segmentation

Tianxiao Li,Junhong Chen, Yiheng Huang, Kesi Huang, Qiqiang Xia,Muhammad Asim,Wenyin Liu

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VI, ICIC 2024(2024)

Guangdong Univ Technol

Cited 0|Views5
Abstract
Referring image segmentation is a challenging task at the intersection of computer vision and natural language processing, aiming to segment out an object referred to by a natural language expression from an image. Recently despite significant progress on this task, existing methods still face challenges in effectively integrating visual and language information and enhancing the model's ability to capture fine-grained details within images. These challenges primarily originate from a lack of a mechanism capable of deeply and comprehensively fusing visual features with language features and effectively utilizing cross-modal features. To address these problems, we propose the Spectrum-driven Multi-scale Visual Transformer (SMVT), which incorporates two innovative designs: Spectrum-driven Fusion Attention (SFA) and the Cross-modal Feature Refinement Enhancement (CFRE) module. SFA, by guiding the fusion of visual and linguistic features at the spectral domain level, effectively captures fine-grained features in images and enhances the model's sensitivity to local spectral domain information, thereby responding more accurately to the detail requirements in language descriptions. CFRE module, by refining and enhancing cross-modal features at different layers, enhances the complementarity and the ability to capture fine-grained cross-modal features across different layers, promoting the precise alignment of visual and language features. These two modules enable the SMVT to more effectively process visual and language information. Experiments on three benchmark datasets have shown that our method surpasses state-of-the-art approaches.
More
Translated text
Key words
Referring image segmentation,Cross-modal learning,Spectrum-driven fusion
求助PDF
上传PDF
Bibtex
AI Read Science
AI Summary
AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.
Example
Background
Key content
Introduction
Methods
Results
Related work
Fund
Key content
  • Pretraining has recently greatly promoted the development of natural language processing (NLP)
  • We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
  • We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
  • The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
  • Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance
Upload PDF to Generate Summary
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined