Efficient CNNs and Transformers for Video Understanding and Image Synthesis

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval(2023)

引用 0|浏览5
暂无评分
摘要
In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased by reducing the temporal feature resolution or the number of tokens, there is no setting that is optimal for all input clips. I will therefore discuss two differentiable sampling approaches that can be plugged into any existing 3D CNN or vision transformer architecture. The sampling approaches adapt the computational resources to the input video such that as much resources as needed but not more than necessary are used to classify a video. The approaches substantially reduce the computational cost (GFLOPs) of state-of-the-art networks while preserving the accuracy. In the second part, I will discuss an approach that generates annotated training samples of very rare classes. It is based on a generative adversarial network (GAN) that jointly synthesizes images and the corresponding segmentation mask for each image. The generated data can then be used for one-shot video object segmentation.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要