Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer for Social Media Popularity Prediction

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览5
暂无评分
摘要
Social media popularity prediction aims to predict future interaction or attractiveness of new posts. However, in most existing works, there is a notable deficiency in the effective treatment of numerical features. Despite their significant potential to provide ample information, these features are often inadequately processed, leading to insufficiency of information acquirement. In this paper, we introduce a method, named Double-Fine-Tuning Multi-Objective Vision-and-Language Transformer (DFT-MOVLT). To supplement the information in vision-and-language pre-training (VLP), we propose compound text, which is concatenated by numerical data and text. Furthermore, during VLP, a transformer is trained using 3 objectives to ensure thorough feature extraction. Finally, for more generalized prediction, we fine-tune 2 models using different training ways and ensemble them. To evaluate the effectiveness of each mechanism adopted in the proposed method, we conduct an array of ablation experiments. Our team achieve the 3rd place in Social Media Prediction (SMP) Challenge 2023.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要