UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Luo Huaishao
Luo Huaishao
Ji Lei
Ji Lei
Huang Haoyang
Huang Haoyang
Chen Xilin
Chen Xilin
Cited by: 0|Bibtex|Views89|Links
Keywords:
fine tunevideo retrievalnoise contrastive estimationunified videoacoustic speech recognitionMore(8+)
Weibo:
We find that 1) our pre-trained model can improve the performance to a large extent over the baseline models and achieve the stateof-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning

Abstract:

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional vide...More

Code:

Data:

0
Introduction
  • With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning good representation for visual and language.
  • Acoustic and language information for people to acquire knowledge or learn how to perform a task.
  • The authors first propose to pre-train a unified videolanguage model using video and acoustic speech recognition (ASR) transcript in instructional videos to learn a joint representation of both video and language.
  • The authors fine-tune this model on two typical multimodal tasks including text-based video retrieval for understanding and multimodal video captioning for generation.
  • Take multimodel video captioning as an example, the model input video and ASR transcript and predict a captioning sentence
Highlights
  • With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning good representation for visual and language
  • Our work differs from VideoBERT and CBT on two aspects: 1) previous work only pre-trains the model on understanding task, while we explore to pre-train on both understanding and generation tasks; 2) they fine-tune the downstream tasks for a better video representation with only video as input, while our goal is to learn video and language joint representation by downstream multimodal tasks
  • We list our contributions below: 1) We propose a multimodal video-language pre-training model trained on a large scale instructional video dataset, which is a unified model for both video-language understanding and generation tasks
  • According to our extensive experiments on text based video retrieval, we find that: 1) our model can largely increase the performance of video and language understanding task; 2) with the increase of the training data, our model performs consistently better; 3) Our model outperforms baselines on both In-domain and Out-domain data and achieves the state-of-the-art results
  • According to our extensive experiments on multimodal video captioning, our key findings are: 1) our pre-trained model can improve the performance of generation task with the help of pre-trained decoder; 2) our model outperforms baseline models for multimodal video captioning task and achieves the state-of-the-art results
  • Our model achieves state-of-the-art results in both tasks
  • We find that 1) our pre-trained model can improve the performance to a large extent over the baseline models and achieve the stateof-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning
Methods
  • Bi-LSTM (Zhou et al, 2018a) EMT (Zhou et al, 2018b).
  • VideoBERT (Sun et al, 2019b) VideoBERT (+S3D) (Sun et al, 2019b) CBT (Sun et al, 2019a).
  • DPC (Shi et al, 2019) AT+Video (Hessel et al, 2019) The authors' model.1st The authors' model.2nd The authors' model.3rd The authors' model.4th The authors' model.5th.
  • Video + Transcript Video + Transcript.
  • Video Video + Transcript Video + Transcript Video + Transcript Video + Transcript.
  • Pre-training Data 0 0 B-3 - M 8.15 11.55 R-L -
Results
  • The authors' extensive experiments show that the method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.
  • According to the extensive experiments on text based video retrieval, the authors find that: 1) the model can largely increase the performance of video and language understanding task; 2) with the increase of the training data, the model performs consistently better; 3) The authors' model outperforms baselines on both In-domain and Out-domain data and achieves the state-of-the-art results.
  • According to the extensive experiments on multimodal video captioning, the key findings are: 1) the pre-trained model can improve the performance of generation task with the help of pre-trained decoder; 2) the model outperforms baseline models for multimodal video captioning task and achieves the state-of-the-art results
Conclusion
  • The authors study the self-supervised learning for video and language representation on large scale videos and pre-train a multimodal model using video and corresponding ASR transcript.
  • The authors propose a unified pre-training model for both understanding and generation tasks.
  • The authors conduct extensive experiments on evaluating the models for two downstream tasks including text-based video retrieval and multimodel video captioning.
  • The authors find that 1) the pre-trained model can improve the performance to a large extent over the baseline models and achieve the stateof-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning.
  • The authors will investigate the performance of the model on a larger dataset and more downstream tasks
Summary
  • Introduction:

    With the recent advances of self-supervised learning, pre-training techniques play a vital role in learning good representation for visual and language.
  • Acoustic and language information for people to acquire knowledge or learn how to perform a task.
  • The authors first propose to pre-train a unified videolanguage model using video and acoustic speech recognition (ASR) transcript in instructional videos to learn a joint representation of both video and language.
  • The authors fine-tune this model on two typical multimodal tasks including text-based video retrieval for understanding and multimodal video captioning for generation.
  • Take multimodel video captioning as an example, the model input video and ASR transcript and predict a captioning sentence
  • Objectives:

    The authors' work differs from VideoBERT and CBT on two aspects: 1) previous work only pre-trains the model on understanding task, while the authors explore to pre-train on both understanding and generation tasks; 2) they fine-tune the downstream tasks for a better video representation with only video as input, while the goal is to learn video and language joint representation by downstream multimodal tasks.
  • Methods:

    Bi-LSTM (Zhou et al, 2018a) EMT (Zhou et al, 2018b).
  • VideoBERT (Sun et al, 2019b) VideoBERT (+S3D) (Sun et al, 2019b) CBT (Sun et al, 2019a).
  • DPC (Shi et al, 2019) AT+Video (Hessel et al, 2019) The authors' model.1st The authors' model.2nd The authors' model.3rd The authors' model.4th The authors' model.5th.
  • Video + Transcript Video + Transcript.
  • Video Video + Transcript Video + Transcript Video + Transcript Video + Transcript.
  • Pre-training Data 0 0 B-3 - M 8.15 11.55 R-L -
  • Results:

    The authors' extensive experiments show that the method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.
  • According to the extensive experiments on text based video retrieval, the authors find that: 1) the model can largely increase the performance of video and language understanding task; 2) with the increase of the training data, the model performs consistently better; 3) The authors' model outperforms baselines on both In-domain and Out-domain data and achieves the state-of-the-art results.
  • According to the extensive experiments on multimodal video captioning, the key findings are: 1) the pre-trained model can improve the performance of generation task with the help of pre-trained decoder; 2) the model outperforms baseline models for multimodal video captioning task and achieves the state-of-the-art results
  • Conclusion:

    The authors study the self-supervised learning for video and language representation on large scale videos and pre-train a multimodal model using video and corresponding ASR transcript.
  • The authors propose a unified pre-training model for both understanding and generation tasks.
  • The authors conduct extensive experiments on evaluating the models for two downstream tasks including text-based video retrieval and multimodel video captioning.
  • The authors find that 1) the pre-trained model can improve the performance to a large extent over the baseline models and achieve the stateof-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning.
  • The authors will investigate the performance of the model on a larger dataset and more downstream tasks
Tables
  • Table1: Results of text-based video retrieval on Youcook2 dataset. PT stands for pre-training and FT for finetuning. † means the re-running the code of HowTo100M model on our dataset
  • Table2: Results of text-based video retrieval on MSR-VTT dataset. PT stands for pre-training and FT for finetuning. † means the re-running the code of HowTo100M model on our dataset
  • Table3: The multimodal video captioning results on Youcook2 dataset
Download tables as Excel
Related work
  • Single Modal Pre-Training Self supervised representation learning has been shown to be effective for sequential data including language and video. Language pre-training models including BERT (Devlin et al, 2019), GPT (Radford et al, 2018), RoBERTa (Liu et al, 2019), XLNet (Yang et al, 2019), MASS (Song et al, 2019), UniLM (Dong et al, 2019), BART (Lewis et al, 2019) have achieved great success on NLP tasks. BERT (Devlin et al, 2019) is a denoising auto-encoder network using Transformer with MLM (masked language model) and NSP (next sentence prediction) as pre-training tasks and has strong performance for understanding task. MASS (Song et al, 2019) focus on pre-training for generation tasks. UniLM (Dong et al, 2019) and BART (Lewis et al, 2019) continuously study a unified pre-training model for both understanding and generation tasks.

    Video representation learning mostly focuses on the video sequence reconstruction or future frames prediction as pre-training (pretext) tasks. Early works like (Mathieu et al, 2015; Srivastava et al, 2015; Han et al, 2019) aim to synthetic video frames through the image patches. Similarly, Wang and Gupta (2015) adopt Siamese-triplet network to rank continuous patches more similar than patches of different videos. Other works predict the feature vectors in latent space using auto-regressive models with the noise contrastive estimation (NCE) (Lotter et al, 2016; Oord et al, 2018). Sun et al (2019a) adopt NCE to make prediction on corrupted (masked) latent space using auto-encoder model.
Reference
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
    Findings
  • Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019a. Unicoder-vl: A universal encoder for vision and language by cross-modal pretraining. arXiv preprint arXiv:1908.06066.
    Findings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
    Findings
  • Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, page 605. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 201Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Jack Hessel, Bo Pang, Zhenhai Zhu, and Radu Soricut. 2019. A case study on combining asr and visual features for generating instructional video captions. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations.
    Google ScholarFindings
  • Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4437–4446.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706– 715.
    Google ScholarLocate open access versionFindings
  • William Lotter, Gabriel Kreiman, and David Cox. 20Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104.
    Findings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23.
    Google ScholarLocate open access versionFindings
  • Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
    Findings
  • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 20Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. ICCV.
    Google ScholarFindings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
    Findings
  • Shruti Palaskar, Jindrich Libovicky, Spandana Gella, and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for compu- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msrtational linguistics, pages 311–318. Association for vtt: A large video description dataset for bridging
    Google ScholarLocate open access versionFindings
  • video and language. In Proceedings of the IEEE con-
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language ference on computer vision and pattern recognition, pages 5288–5296.
    Google ScholarFindings
  • assets/researchcovers/languageunsupervised/language 2019. Xlnet: Generalized autoregressive pretrainunderstanding paper. pdf.
    Google ScholarFindings
  • Zhendong Niu, and Ming Zhou. 2019. Dense procedure captioning in narrated instructional videos. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 6382– 6391.
    Google ScholarLocate open access versionFindings
  • Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 471–487.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and TieYan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
    Findings
  • Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2019. Unified vision-language pre-training for image captioning and vqa. arXiv preprint arXiv:1909.11059.
    Findings
  • Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019a. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743.
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019b. Videobert: A joint model for video and language representation learning. Proceedings of the IEEE international conference on computer vision.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018a. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018b. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739–8748.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
    Google ScholarLocate open access versionFindings
  • Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
Your rating :
0

 

Tags
Comments