In-Context Translation: Towards Unifying Image Recognition, Processing, and Generation
CoRR(2024)
摘要
We propose In-Context Translation (ICT), a general learning framework to
unify visual recognition (e.g., semantic segmentation), low-level image
processing (e.g., denoising), and conditional image generation (e.g.,
edge-to-image synthesis). Thanks to unification, ICT significantly reduces the
inherent inductive bias that comes with designing models for specific tasks,
and it maximizes mutual enhancement across similar tasks. However, the
unification across a large number of tasks is non-trivial due to various data
formats and training pipelines. To this end, ICT introduces two designs.
Firstly, it standardizes input-output data of different tasks into RGB image
pairs, e.g., semantic segmentation data pairs an RGB image with its
segmentation mask in the same RGB format. This turns different tasks into a
general translation task between two RGB images. Secondly, it standardizes the
training of different tasks into a general in-context learning, where
"in-context" means the input comprises an example input-output pair of the
target task and a query image. The learning objective is to generate the
"missing" data paired with the query. The implicit translation process is thus
between the query and the generated image. In experiments, ICT unifies ten
vision tasks and showcases impressive performance on their respective
benchmarks. Notably, compared to its competitors, e.g., Painter and
PromptDiffusion, ICT trained on only 4 RTX 3090 GPUs is shown to be more
efficient and less costly in training.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要