PhD graduated
Team : MLIA
Localisation : Campus Pierre et Marie Curie
    Sorbonne Université - LIP6
    Boîte courrier 169
    Couloir 26-00, Étage 5, Bureau 525
    4 place Jussieu
    75252 PARIS CEDEX 05
Hedi.Ben-Younes (at)
Supervision : Matthieu CORD
Co-supervision : THOME Nicolas

Multi-modal representation learning towards visual reasoning

The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding.
In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature.
Defence : 05/20/2019 - 10h - Site Jussieu 25-26/105
Jury members :
M. Jakob Verbeek, INRIA Grenoble [rapporteur]
M. Christian Wolf, INSA de Lyon [rapporteur]
M. Vittorio Ferrari, Google AI - University of Edinburgh
M. Yann LeCun, Facebook - NYU
M. Patrick Pérez, Valeo AI
Mme Laure Soulier, Sorbonne Université - LIP6
M. Nicolas Thome, CNAM - CEDRIC
M. Matthieu Cord, Sorbonne Université - LIP6

2019 Publications

 Mentions légales
Site map |