InteractNet: Social Interaction Recognition for Semantic-rich Videos

ACM Transactions on Multimedia Computing, Communications, and Applications(2024)

引用 0|浏览1
暂无评分
摘要
The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要