FNet with Cross-Attention Encoder for Visual Question Answering

Zekrallah Samar I.,Khalifa Nour El-Deen, Hassanin Aboul Ella

Proceedings of the 8th International Conference on Advanced Intelligent Systems and Informatics 2022(2022)

引用 0|浏览0
暂无评分
摘要
Visual question answering (VQA) is a challenging research area where model needs to understand image semantics along with the asked question in order to give a correct answer. Lately transformers showed improvement to the performance of deep learning models than using traditional sequence to sequence models like LSTMs, RNNs. Transformer models use attention mechanism and using it with complex models requires longer time and huge resources for training. In this paper, VQA task is accomplished using three transformer encoders where transformer’s self attention sub-layers are replaced with Fourier transforms which is called FNet that mixes input tokens in question and image encoders. Self attention is used only in the cross modality encoder to enhance accuracy. Experiment is done on two phases: Firstly, Pre-training is done on a subset of LXMERT dataset (5.99% of LXMERT’s instances)due to resources limitations and the Second phase is fine tuning on VQA v.2 dataset. It results in 24% faster Pre-training time but testing accuracy decreased by 5.61% than using encoders with BERT self attention only in all sub-layers. Model is also pre-trained using FNet sub-layers only and it trained faster by 30.6% than using only BERT self-attention sub-layers but got lower testing accuracy result (48.79%).
更多
查看译文
关键词
Faster-RCNN,Artificial Intelligence,Deep Learning,Natural Language Processing,Transformers,Fourier Transform,BERT,FNet
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要