Large Language Models for Captioning and Retrieving Remote Sensing Images
CoRR(2024)
摘要
Image captioning and cross-modal retrieval are examples of tasks that involve
the joint analysis of visual and linguistic information. In connection to
remote sensing imagery, these tasks can help non-expert users in extracting
relevant Earth observation information for a variety of applications. Still,
despite some previous efforts, the development and application of vision and
language models to the remote sensing domain have been hindered by the
relatively small size of the available datasets and models used in previous
studies. In this work, we propose RS-CapRet, a Vision and Language method for
remote sensing tasks, in particular image captioning and text-image retrieval.
We specifically propose to use a highly capable large decoder language model
together with image encoders adapted to remote sensing imagery through
contrastive language-image pre-training. To bridge together the image encoder
and language decoder, we propose training simple linear layers with examples
from combining different remote sensing image captioning datasets, keeping the
other parameters frozen. RS-CapRet can then generate descriptions for remote
sensing images and retrieve images from textual descriptions, achieving SOTA or
competitive performance with existing methods. Qualitative results illustrate
that RS-CapRet can effectively leverage the pre-trained large language model to
describe remote sensing images, retrieve them based on different types of
queries, and also show the ability to process interleaved sequences of images
and text in a dialogue manner.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要