Boosting Zero-Shot Human-Object Interaction Detection with Vision-Language Transfer

Sandipan Sarma, Pradnesh Kalkar,Arijit Sur

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览1
暂无评分
摘要
Human-Object Interaction (HOI) detection is a crucial task that involves localizing interactive human-object pairs and identifying the actions being performed. Most existing HOI detectors are supervised in nature and lack the ability of zero-shot discovery of unseen interactions. Recently, transformer-based methods have superseded the traditional CNN detectors by aggregating image-wide context but still suffer from the long-tail distribution problem in HOI. In this work, our primary focus is improving HOI detection in images, particularly in zero-shot scenarios. We use an end-to-end transformer-based object detector to localize human-object pairs and yield visual features of actions and objects. Moreover, we adopt the text encoder from a popular visual-language model called CLIP with a novel prompting mechanism to extract semantic information for unseen actions and objects. Finally, we learn a strong visual-semantic alignment and achieve state-of-the-art performance on the challenging HICO-DET dataset across five zero-shot settings, with up to 70.88% relative gains. Code is available at https://github.com/sandipan211/ZSHOI-VLT.
更多
查看译文
关键词
Human-object interaction,transformer,CLIP,zero-shot learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要