Parameter-efficient vision transformer with linear attention

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP(2023)

引用 0|浏览2
暂无评分
摘要
Recent advances in vision transformers (ViTs) have achieved outstanding performance in visual recognition tasks, including image classification and detection. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for resource-constrained devices. In this paper, we propose a novel linear feature attention (LFA) module to reduce computation costs for vision transformers and combine efficient mobile CNN modules to form a parameter-efficient and high-performance CNN-ViT hybrid model, called LightFormer, which can serve as a general-purpose backbone to learn both global and local representation. Comprehensive experiments demonstrate that LightFormer achieves competitive performance across different visual recognition tasks. On the ImageNet-1K dataset, LightFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters. Our model also performs well when transferred to object detection and semantic segmentation tasks. On the MS COCO dataset, LightFormer attains mAP of 33.2 within the YOLOv3 framework, and on the Cityscapes dataset, with only a simple all-MLP decoder, LightFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.
更多
查看译文
关键词
vision transformers,self-attention,image classification,semantic segmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要