Improving Image Captioning via Dual-Stream Multi-Layer Dynamic Fusion

Research Square (Research Square)(2023)

引用 0|浏览18
Abstract Recent researches in image captioning have focused on combining region features and grid features to enhance visual information. However, conventional fusion methods may introduce semantic noise that disturbs model prediction. To address the problem, we propose a simple and efficient dual-stream network, Multi-layer Dynamic Fusion Transformer (MDFT), which employs a multi-layer dynamic fusion method to simultaneously enhance region features and grid features. Specifically, we introduce Cross-layer Perceptual Self-Attention (CPSA) module that locally models visual features at multiple layers to refine visual features. Additionally, we design an Adaptive Selection Controller (ASC) to reduce the computational burden of the layered structure and dynamically select attention layers with similar semantic information for interaction. Through this fusion strategy, MDFT effectively reduces semantic noise and achieves complementary advantages of region features and grid features. The experimental results on the MS-COCO dataset indicate that the MDFT model achieved relatively advanced performance on both local and online test sets, with respective scores of 134.0% and 133.7%.
image captioning,fusion,dual-stream,multi-layer
AI 理解论文
Chat Paper