Multimodal machine learning: the case of vision-language transformers

titleMultimodal machine learning: the case of vision-language transformers
start_date2023/11/03
schedule14h-15h
onlineno
location_infoDoyen 22 & via Teams
summaryVision-Language transformer models combine information from the textual and visual modalities to extract multimodal representations. These models can be used as a basis for many multimodal vision-language tasks. Large pre-trained models based on the transformer architecture, inspired by recent advances in Natural Language Processing, have enabled great improvement on those tasks. In this presentation, I will give an overview of vision-language transformer models. I will introduce the different types of models, in terms of architecture and pre-training methods. I will also present the strengths and weaknesses of those different methods. Finally, I will talk about current challenges and emerging trends of research in vision-language machine learning.
responsiblesRolin