PhD graduated
Team : MLIA
Departure date : 07/20/2019

Supervision : Matthieu CORD

Co-supervision : THOME Nicolas

Multi-modal representation learning towards visual reasoning

The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding.
In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature.

Defence : 05/20/2019 - 10h - Campus Pierre et Marie Curie, salle Jacques Pitrat (25-26/105)

Jury members :

M. Jakob Verbeek, INRIA Grenoble [rapporteur]
M. Christian Wolf, INSA de Lyon [rapporteur]
M. Vittorio Ferrari, Google AI - University of Edinburgh
M. Yann LeCun, Facebook - NYU
M. Patrick Pérez, Valeo AI
Mme Laure Soulier, Sorbonne Université - LIP6
M. Nicolas Thome, CNAM - CEDRIC
M. Matthieu Cord, Sorbonne Université - LIP6

2017-2019 Publications