Visual and semantic ensemble for scene text recognition with gated dual mutual attention

Int. J. Multim. Inf. Retr.(2022)

引用 0|浏览2
暂无评分
摘要
Scene text recognition is a challenging task in computer vision due to the significant differences in text appearance, such as image distortion and rotation. However, linguistic prior helps individuals reason text from images even if some characters are missing or blurry. This paper investigates the fusion of visual cues and linguistic dependencies to boost recognition performance. We introduce a relational attention module to leverage visual patterns and word representations. We embed linguistic dependencies from a language model into the optimization framework to ensure that the predicted sequence captures the contextual dependencies within a word. We propose a dual mutual attention transformer that promotes cross-modality interactions such that the inter- and intra-correlations between visual and linguistic can be fully explored. The introduced gate function enables the model to learn to determine the contribution of each modality and further boost the model performance. Extensive experiments demonstrate that our method enhances the recognition performance of low-quality images and achieves state-of-the-art performance on datasets of texts from regular and irregular scenes.
更多
查看译文
关键词
Text recognition,Multimodal fusion,Convolutional neural network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要