CrackViT: a unified CNN-transformer model for pixel-level crack extraction

NEURAL COMPUTING & APPLICATIONS(2023)

引用 2|浏览0
暂无评分
摘要
Pixel-level crack extraction (PCE) is challenging due to topology complexity, irregular edges, low contrast ratio, and complex background. Recently, Transformer architectures have shown great potential on many vision tasks and even outperform convolutional neural networks (CNNs). Benefiting from the self-attention mechanism, Transformers can invariably capture the global context information to establish long-range dependencies on the detected objects. However, there was little work on the Transformer architectures for PCE. In this paper, a systematic analysis of three well-designed Transformer architectures for PCE task in terms of network structures and parameters, feature fusion modes, training data and strategy, and generalization ability was developed for the first time. We proposed a Crack extraction network with Vision Transformer (CrackViT) that jointly captures the detailed structures and long-distance dependencies with a novel hybrid encoder with CNN and Transformer to keep the corresponding topologies. In order to be more suitable for PCE task, we explored three feature fusion modes between CNN and Transformer. In addition, a novel feature aggregation block was proposed to sharpen the edges of the decoder upsampling and reduce the noise effect of shallow features. Moreover, a multi-task supervised training strategy was adopted to further improve the details of crack edges. Results on four challenging datasets, including CrackForest, DeepCrack, CRKWH100, and CRACK500, show that CrackViT outperforms state-of-the-art CNN-based methods and the other two novel Transformer architectures. Our codes are available at: https://github.com/SmilQe/CrackViT .
更多
查看译文
关键词
Crack structures,Crack extraction,Vision transformer,UNet framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要