An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

Rafael J. Pezzuto Damaceno,Roberto M. Cesar

PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I（2024）

引用 0|浏览0

暂无评分

摘要

Video captioning is a computer vision task that aims at generating a description for video content. This can be achieved using deep learning approaches that leverage image and audio data. In this work, we have developed two strategies to tackle this task in the context of resource-constrained devices: (i) generating one caption per frame combined with audio classification, and (ii) generating one caption for a set of frames combined with audio classification. In these strategies, we have utilized one architecture for the image data and another for the audio data. We have developed an application tailored for resource-constrained devices, where the image sensor captures images at a specific frame rate. The audio data is captured from a microphone for a predefined duration at time. Our application combines the results from both modalities to create a comprehensive description. The main contribution of this work is the introduction of a new end-to-end application that can utilize the developed strategies and be beneficial for environment monitoring. Our method has been implemented on a low-resource computer, which poses a significant challenge.

查看译文

关键词

Video captioning,Mobile device,Deep Learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要