Learning by Imagination: A Joint Framework for Text-Based Image Manipulation and Change Captioning.

IEEE Transactions on Multimedia(2023)

Cited 4|Views51
No score
Image and text are dual modalities of our semantic interpretation. Changing images based on text descriptions allows us to imagine and visualize the world (a.k.a. text-based image manipulation (TIM)). In this paper, we introduce a framework that combines TIM with change captioning (CC) and utilizes the benefits of co-training. CC aims to describe what has changed in a scene and can be regarded as the inverse version of TIM where both tasks rely on generative networks. These generative networks can be regarded as data producers of each other and unlike previous methods, we discover that integrating their learning procedures can benefit both. Since the CC module describes differences between two images as text, the CC module can be used as evaluation criteria and provide feedback. Furthermore, we utilize a shared attention mechanism in TIM and CC modules to localize towards prominent regions as well as enabling a change-aware discriminator. In the opposite direction, the output image synthesized by the TIM module can be assessed with the CC module, by checking whether the ground truth text description can be redescribed. Following this insight, not only do we boost the training of the TIM module, but we also utilize the TIM module as additional supervision for the CC training. Experimental results show that our framework outperforms existing TIM methods on several datasets substantially and we achieve marginal improvements in the CC module. To our best knowledge, this is the first study dedicated to the joint training of TIM and CC tasks.
Translated text
Key words
Text-based image manipulation, change captioning, generative networks, GANs, reinforcement learning
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined