Fine-grained Video Captioning via Precise Key Point Positioning

Yunjie Zhang, Tiangyang Xu,Xiaoning Song,Zhenghua Feng,Xiao-Jun Wu

PIC '22: Proceedings of the 4th on Person in Context Workshop(2022)

引用 0|浏览0
暂无评分
摘要
In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要