Pretraining Unicoder-VL only slightly improves the performance. This might be because the pre-training task of image captioning is at the perceptual level, while the visual commonsense reasoning task is at the cognitive understanding level
Compared to our baseline model that only uses text pre-training, crossmodal pre-training tasks improves the performance on all metrics, which validates the importance of Image-Text Pre-training for generation tasks
We present several alternate ways of viewing Retrieval-Augmented Language Model that connect it to a broader set of ideas beyond Open-domain Question Answering: Language modeling with corpus as context Language representation models have been incorporating contexts of increasingl...
We find that 1) our pre-trained model can improve the performance to a large extent over the baseline models and achieve the stateof-the-art results on two typical multimodal tasks; 2) The pre-trained decoder can benefit the generation tasks such as captioning
While many modern approaches to transfer learning for natural language processing use a Transformer architecture consisting of only a single “stack”, we found that using a standard encoder-decoder structure achieved good results on both generative and classification tasks
Compared with most previous biomedical text mining models that are mainly focused on a single task such as named entity recognition or question answering, our model BioBERT achieves state-of-the-art performance on various biomedical text mining tasks, while requiring only minimal...
We have proposed MAsked Sequence to Sequence pre-training: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework