SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

Atharva Naik,Yash Parag Butala, Navaneethan Vaikunthan, Raghav Kapoor

AAAI 2024(2024)

引用 0|浏览0
暂无评分
摘要
When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.
更多
查看译文
关键词
Computer Vision,Machine Learning,Machine Perception,Applications Of AI,Natural Language Processing,Multi-modal Vision,Multimodal Learning,Language And Vision,Language Grounding & Multi-modal NLP,Visual Question Answering,Question Answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要