Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers
arxiv(2023)
摘要
Most models of visual attention aim at predicting either top-down or
bottom-up control, as studied using different visual search and free-viewing
tasks. In this paper we propose the Human Attention Transformer (HAT), a single
model that predicts both forms of attention control. HAT uses a novel
transformer-based architecture and a simplified foveated retina that
collectively create a spatio-temporal awareness akin to the dynamic visual
working memory of humans. HAT not only establishes a new state-of-the-art in
predicting the scanpath of fixations made during target-present and
target-absent visual search and “taskless” free viewing, but also makes human
gaze behavior interpretable. Unlike previous methods that rely on a coarse grid
of fixation cells and experience information loss due to fixation
discretization, HAT features a sequential dense prediction architecture and
outputs a dense heatmap for each fixation, thus avoiding discretizing
fixations. HAT sets a new standard in computational attention, which emphasizes
effectiveness, generality, and interpretability. HAT's demonstrated scope and
applicability will likely inspire the development of new attention models that
can better predict human behavior in various attention-demanding scenarios.
Code is available at https://github.com/cvlab-stonybrook/HAT.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要