xT: Nested Tokenization for Larger Context in Large Images
arxiv(2024)
摘要
Modern computer vision pipelines handle large images in one of two
sub-optimal ways: down-sampling or cropping. These two methods incur
significant losses in the amount of information and context present in an
image. There are many downstream applications in which global context matters
as much as high frequency details, such as in real-world satellite imagery; in
such cases researchers have to make the uncomfortable choice of which
information to discard. We introduce xT, a simple framework for vision
transformers which effectively aggregates global context with local details and
can model large images end-to-end on contemporary GPUs. We select a set of
benchmark datasets across classic vision tasks which accurately reflect a
vision model's ability to understand truly large images and incorporate fine
details over large scales and assess our method's improvement on them. By
introducing a nested tokenization scheme for large images in conjunction with
long-sequence length models normally used for natural language processing, we
are able to increase accuracy by up to 8.6
and F_1 score by 11.6 on context-dependent segmentation in large images.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要