PicSOM Experiments in TRECVID 2018 PicSOM Experiments in TRECVID 2018 Workshop notebook paper

Hamed R. Tavakoli,Zhicun Xu,Héctor Laria Mantecón,Jorma Laaksonen

semanticscholar（2019）

引用 0|浏览0

暂无评分

摘要

This year, the PicSOM group participated only in the Video to Text (VTT), Description Generation subtask. For our submitted runs we used either the MSR-VTT dataset only, or MS COCO and MSR-VTT jointly for training. We used LSTM recurrent neural networks to generate descriptions based on multi-modal features extracted from the videos. We submitted four runs: • PICSOM 1: uses ResNet features for initialising the LSTM generator, and object and scene-type detection features as persistent input to the generator which is trained on MS COCO + MSR-VTT, • PICSOM 2: uses ResNet and object detection features for initialisation, and is trained on MS COCO + MSR-VTT, this is the only run based on our new PyTorch codebase, • PICSOM 3: uses ResNet and video category features for initialisation, and trajectory and audio-visual embedding features for persistent features, trained on MSR-VTT only, • PICSOM PICSOM except that audio-visual embedding with audio class detection outputs. The most signiﬁcant difference between our runs came from expanding the original MSR-VTT training dataset by including MS COCO, which contains images annotated with captions. Having a larger and more diverse training set seems to bring larger improvements to the performance measures than using more advanced features. This ﬁnding has been conﬁrmed also by our post-submission experiments that we are still continuing.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要