Visual and semantic ensemble for scene text recognition with gated dual mutual attention

Int. J. Multim. Inf. Retr.(2022)

引用 0|浏览2
Scene text recognition is a challenging task in computer vision due to the significant differences in text appearance, such as image distortion and rotation. However, linguistic prior helps individuals reason text from images even if some characters are missing or blurry. This paper investigates the fusion of visual cues and linguistic dependencies to boost recognition performance. We introduce a relational attention module to leverage visual patterns and word representations. We embed linguistic dependencies from a language model into the optimization framework to ensure that the predicted sequence captures the contextual dependencies within a word. We propose a dual mutual attention transformer that promotes cross-modality interactions such that the inter- and intra-correlations between visual and linguistic can be fully explored. The introduced gate function enables the model to learn to determine the contribution of each modality and further boost the model performance. Extensive experiments demonstrate that our method enhances the recognition performance of low-quality images and achieves state-of-the-art performance on datasets of texts from regular and irregular scenes.
Text recognition,Multimodal fusion,Convolutional neural network
AI 理解论文
Chat Paper