Multimodal perception and reasoning

titleMultimodal perception and reasoning
start_date2025/02/21
schedule11h
onlineno
location_infovisioconférence Big Blue Button
summaryBuilding on the strong textual processing capabilities of large language models, large vision-language models (VLMs) extend LLMs to handle visual inputs. They have brought significant improvements to multi-modal tasks such as visual question answering and image captioning. In particular, they paved the way for tasks involving complex visual reasoning. However, the transfer of LLM's internal knowledge and their reasoning ability to multimodal tasks remains limited. In this talk, I will present two of my recent work on evaluating and improving VLMs' perception and reasoning capabilities.
responsiblesBawden