|
Multimodal machine learning: the case of vision-language transformers| title | Multimodal machine learning: the case of vision-language transformers |
|---|
| start_date | 2023/11/03 |
|---|
| schedule | 14h-15h |
|---|
| online | no |
|---|
| location_info | Doyen 22 & via Teams |
|---|
| summary | Vision-Language transformer models combine information from the textual and visual modalities to extract multimodal representations. These models can be used as a basis for many multimodal vision-language tasks. Large pre-trained models based on the transformer architecture, inspired by recent advances in Natural Language Processing, have enabled great improvement on those tasks.
In this presentation, I will give an overview of vision-language transformer models. I will introduce the different types of models, in terms of architecture and pre-training methods. I will also present the strengths and weaknesses of those different methods. Finally, I will talk about current challenges and emerging trends of research in vision-language machine learning. |
|---|
| responsibles | Rolin |
|---|
Workflow history| from state (1) | to state | comment | date |
| submitted | published | | 2023/10/09 13:25 UTC |
| |
|