Attention-Based Comparison of Automatic Image Caption Generation Encoders

NagaDurga Cheboyina Sindhu,Anuradha T.

Advances in Micro-Electronics, Embedded Systems and IoT(2022)

引用 0|浏览5
暂无评分
摘要
Generating captions to images has still been a challenging task. Image captioning is a combination of both computer vision and natural language processing (NLP) which has many applications in social networking and is advantageous to people who are impaired visually. There are different encoders (CNN) for feature extraction from the input image and decoders (RNN) for the language model and attention mechanisms which concentrate on relevant data to improve the model’s performance. In this paper, for the comparison of encoders, VGG19 and ResNet152 are used and LSTM as a decoder to generate captions. Along with the decoder, visual attention mechanism is used which allows the human or a system to concentrate on the essential parts from the input data. Visual attention mechanism is also widely used in video analytics. The proposed work uses the MSCOCO dataset for both architectures. The generated captions are then compared with the actual captions using the BLEU score. From the proposed models, the generated captions are 80 per cent accurate.
更多
查看译文
关键词
Deep learning, Image captioning, ResNet152, VGG19, Visual attention, BLEU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要