PhD graduated
Team : MLIA
Departure date : 10/31/2019

Supervision : Patrick GALLINARI, Benjamin PIWOWARSKI, Laure SOULIER

Multimodal machine learning: complementarity of textual and visual contexts

Research looking at the interaction between language and vision, despite a growing interest, is relatively underexplored. Beyond trivial differences between texts and images, these two modalities have non overlapping semantics. On the one hand, language can express high-level semantics about the world, but it is biased in the sense that a large portion of its content is implicit (common-sense or implicit knowledge). On the other hand, images are aggregates of lower-level information, but they can depict a more direct view of real-world statistics and can be used to ground the meaning of objects. In this thesis, we exploit connections and leverage complementarity between language and vision.
First, natural language understanding capacities can be augmented with the help of the visual modality, as language is known to be grounded in the visual world. In particular, representing language semantics is a long-standing problem for the natural language processing community, and to further improve traditional approaches towards that goal, leveraging visual information is crucial. We show that semantic linguistic representations can be enriched by visual information, and we especially focus on visual contexts and spatial organization of scenes. We present two models to learn grounded word or sentence semantic representations respectively, with the help of images.
Conversely, integrating language with vision brings the possibility of expanding the horizons and tasks of the vision community. Assuming that language contains visual information about objects, and that this can be captured within linguistic semantic representation, we focus on the zero-shot object recognition task, which consists in recognizing objects that have never been seen thanks to linguistic knowledge acquired about the objects beforehand. In particular, we argue that linguistic representations not only contain visual information about the visual appearance of objects but also about their typical visual surroundings and visual occurrence frequencies. We thus present a model for zero-shot recognition that leverages the visual context of an object, and its visual occurrence likelihood, in addition to the region of interest as done in traditional approaches.
Finally, we present prospective research directions to further exploit connections between language and images and to better understand the semantic gap between the two modalities.

Defence : 10/14/2019

Jury members :

M Guillaume Gravier, IRISA [Rapporteur]
Mme Marie-Francine Moens, KU Leuven [Rapporteur]
M Antoine Bordes (examinateur), Facebook
M Patrick Gallinari, Sorbonne Université LIP6 / Criteo
M Benjamin Piwowarski, Sorbonne Université LIP6
Mme Laure Soulier, Sorbonne Université LIP6
M Xavier Tannier, Sorbonne Université LIMICS

Departure date : 10/31/2019

2017-2019 Publications

Mentions légales
Site map