VGGAN: Visual Grounding GAN Using Panoptic Transformers

Fengnan Quan,Bo Lang

2023 8th International Conference on Image, Vision and Computing (ICIVC)(2023)

引用 0|浏览2
暂无评分
摘要
Visual Grounding is an important part of image annotation generation. The existing methods usually use data alignment based on the similarity calculation of visual text features in location inference and multi-modal fusion, which will lose visual and text information to some extent, and is more likely to make the model overfit the data of specific scenes. To solve this problem, we propose a Visual Grounding Generative Adversarial Network (VGGAN) for visual text fusion using the panoptic transformer. We use the generative adversarial network to generate the prediction, judge the accuracy, and design the visual text transformer according to the panoptic theory. The model can retain the feature information, realize the full interactions between features, thereby better supporting the feature fusion of visual and text. Experimental results on the COCO dataset of complex daily scenes verify the effectiveness of our model, and our model achieves the highest prediction accuracy compared with the state-of-the-art methods.
更多
查看译文
关键词
visual grounding,panoptic theory,transformer,generative adversarial network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要