Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens
CoRR(2024)
摘要
Many RGBT tracking researches primarily focus on modal fusion design, while
overlooking the effective handling of target appearance changes. While some
approaches have introduced historical frames or fuse and replace initial
templates to incorporate temporal information, they have the risk of disrupting
the original target appearance and accumulating errors over time. To alleviate
these limitations, we propose a novel Transformer RGBT tracking approach, which
mixes spatio-temporal multimodal tokens from the static multimodal templates
and multimodal search regions in Transformer to handle target appearance
changes, for robust RGBT tracking. We introduce independent dynamic template
tokens to interact with the search region, embedding temporal information to
address appearance changes, while also retaining the involvement of the initial
static template tokens in the joint feature extraction process to ensure the
preservation of the original reliable target appearance information that
prevent deviations from the target appearance caused by traditional temporal
updates. We also use attention mechanisms to enhance the target features of
multimodal template tokens by incorporating supplementary modal cues, and make
the multimodal search region tokens interact with multimodal dynamic template
tokens via attention mechanisms, which facilitates the conveyance of
multimodal-enhanced target change information. Our module is inserted into the
transformer backbone network and inherits joint feature extraction,
search-template matching, and cross-modal interaction. Extensive experiments on
three RGBT benchmark datasets show that the proposed approach maintains
competitive performance compared to other state-of-the-art tracking algorithms
while running at 39.1 FPS.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要