谷歌浏览器插件
订阅小程序
在清言上使用

Ε-Vilm : Efficient Video-Language Model Via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024(2024)

引用 0|浏览29
关键词
Masked Videos,Learning Strategies,Efficient Learning,Masked Images,Semantic Labels,Inference Speed,Top-1 Accuracy,Region Labels,Transformer,Natural Language,Visual Representation,Extensive Experiments,Object Detection,Recognition Task,Obvious Advantages,Representation Learning,Video Frames,Multiple Tasks,Language Model,Masked Language Model,Discrete Labels,Video Retrieval,Visual Encoding,Latent Code,Video Encoding,Action Recognition,Language Mode,Prior Art,Linear Probe
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要