Scene Text Segmentation with Text-Focused Transformers

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览13
暂无评分
摘要
Text segmentation is a crucial aspect of various text-related tasks, including text erasing, text editing, and font style transfer. In recent years, multiple text segmentation datasets, such as TextSeg focusing on Latin text segmentation and BTS on bilingual text segmentation, have been proposed. However, existing methods either disregard the annotations of text location or directly use pre-trained text detectors. In general, these methods cannot fully utilize the annotations of text location in the datasets. To explicitly incorporate text location information to guide text segmentation, we propose an end-to-end text-focused segmentation framework, where text detection and segmentation are jointly optimized. In the proposed framework, we first extract multi-level global visual features through residual convolution blocks and then predict the mask of text areas using a text detection head. Subsequently, we develop a text-focused module that compels the model to pay more attention to text areas. Specifically, we introduce two types of attention masks to extract corresponding features: text-aware and instance-aware features. Finally, we employ hierarchical Transformer encoders to fuse multi-level features and predict the text mask with a text segmentation head. To evaluate the effectiveness of our method, we conduct experiments on six text segmentation benchmarks. The experimental results demonstrate that the proposed method outperforms the previous state-of-the-art (SOTA) methods by a clear margin in most cases. The code and supplementary materials are available at https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers https://github.com/FudanVI/FudanOCR/tree/main/text-focused-Transformers.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要