GLID: Pre-training a Generalist Encoder-Decoder Vision Model
CVPR 2024(2024)
摘要
This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method
for better handling various downstream computer vision tasks. While
self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown
success in transfer learning, task-specific sub-architectures are still
required to be appended for different downstream tasks, which cannot enjoy the
benefits of large-scale pre-training. GLID overcomes this challenge by allowing
the pre-trained generalist encoder-decoder to be fine-tuned on various vision
tasks with minimal task-specific architecture modifications. In the GLID
training scheme, pre-training pretext task and other downstream tasks are
modeled as "query-to-answer" problems, including the pre-training pretext task
and other downstream tasks. We pre-train a task-agnostic encoder-decoder with
query-mask pairs. During fine-tuning, GLID maintains the pre-trained
encoder-decoder and queries, only replacing the topmost linear transformation
layer with task-specific linear heads. This minimizes the pretrain-finetune
architecture inconsistency and enables the pre-trained model to better adapt to
downstream tasks. GLID achieves competitive performance on various vision
tasks, including object detection, image segmentation, pose estimation, and
depth estimation, outperforming or matching specialist models such as
Mask2Former, DETR, ViTPose, and BinsFormer.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要