Supervised Fine-tuning in turn Improves Visual Foundation Models
arxiv(2024)
摘要
Image-text training like CLIP has dominated the pretraining of vision
foundation models in recent years. Subsequent efforts have been made to
introduce region-level visual learning into CLIP's pretraining but face
scalability challenges due to the lack of large-scale region-level datasets.
Drawing inspiration from supervised fine-tuning (SFT) in natural language
processing such as instruction tuning, we explore the potential of fine-grained
SFT in enhancing the generation of vision foundation models after their
pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash
the fine-grained knowledge of vision foundation models. In ViSFT, the vision
foundation model is enhanced by performing visual joint learning on some
in-domain tasks and then tested on out-of-domain benchmarks. With updating
using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over
4.4B parameters shows improvements across various out-of-domain benchmarks
including vision and vision-linguistic scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要