Supervision : Patrick GALLINARI, Benjamin PIWOWARSKI, Laure SOULIER
Multimodal machine learning: complementarity of textual and visual contexts
Research looking at the interaction between language and vision, despite a growing interest, is relatively underexplored. Beyond trivial differences between texts and images, these two modalities have non overlapping semantics. On the one hand, language can express high-level semantics about the world, but it is biased in the sense that a large portion of its content is implicit (common-sense or implicit knowledge). On the other hand, images are aggregates of lower-level information, but they can depict a more direct view of real-world statistics and can be used to ground the meaning of objects. In this thesis, we exploit connections and leverage complementarity between language and vision.
First, natural language understanding capacities can be augmented with the help of the visual modality, as language is known to be grounded in the visual world. In particular, representing language semantics is a long-standing problem for the natural language processing community, and to further improve traditional approaches towards that goal, leveraging visual information is crucial. We show that semantic linguistic representations can be enriched by visual information, and we especially focus on visual contexts and spatial organization of scenes. We present two models to learn grounded word or sentence semantic representations respectively, with the help of images.
Conversely, integrating language with vision brings the possibility of expanding the horizons and tasks of the vision community. Assuming that language contains visual information about objects, and that this can be captured within linguistic semantic representation, we focus on the zero-shot object recognition task, which consists in recognizing objects that have never been seen thanks to linguistic knowledge acquired about the objects beforehand. In particular, we argue that linguistic representations not only contain visual information about the visual appearance of objects but also about their typical visual surroundings and visual occurrence frequencies. We thus present a model for zero-shot recognition that leverages the visual context of an object, and its visual occurrence likelihood, in addition to the region of interest as done in traditional approaches.
Finally, we present prospective research directions to further exploit connections between language and images and to better understand the semantic gap between the two modalities.
Defence : 10/14/2019 - 10h30 - Campus Pierre et Marie Curie, salle Jacques Pitrat (25-26/105)
Jury members :
M Guillaume Gravier, IRISA [Rapporteur]
Mme Marie-Francine Moens, KU Leuven [Rapporteur]
M Antoine Bordes (examinateur), Facebook
M Patrick Gallinari, Sorbonne Université LIP6 / Criteo
M Benjamin Piwowarski, Sorbonne Université LIP6
Mme Laure Soulier, Sorbonne Université LIP6
M Xavier Tannier, Sorbonne Université LIMICS
- E. Zablocki : “Multimodal machine learning: complementarity of textual and visual contexts”, thesis, defence 10/14/2019, supervision Gallinari, Patrick Piwowarski, Benjamin Soulier, Laure (2019)
- P. Bordes, É. Zablocki, L. Soulier, B. Piwowarski, P. Gallinari : “Incorporating Visual Semantics into Sentence Representations within a Grounded Space”, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 696-707, (Association for Computational Linguistics) (2019)
- É. Zablocki, P. Bordes, B. Piwowarski, L. Soulier, P. Gallinari : “Context-Aware Zero-Shot Learning for Object Recognition”, Thirty-sixth International Conference on Machine Learning (ICML), Long Beach, CA, United States (2019)
- P. Bordes, É. Zablocki, L. Soulier, B. Piwowarski : “Un modèle multimodal d’apprentissage de représentations de phrases qui préserve la sémantique visuelle”, COnférence en Recherche d'Informations et Applications, COnférence en Recherche d'Informations et Applications - CORIA 2019, 16th French Information Retrieval Conference. Lyon, France, May 25-29, 2019. Proceedings, Lyon, France (2019)
- É. Zablocki, B. Piwowarski, L. Soulier, P. Gallinari : “Apprentissage multimodal de représentation de mots à l’aide de contexte visuel”, Conférence sur l'Apprentissage Automatique, Rouen, France (2018)
- É. Zablocki, B. Piwowarski, L. Soulier, P. Gallinari : “Learning Multi-Modal Word Representation Grounded in Visual Context”, Association for the Advancement of Artificial Intelligence (AAAI), New Orleans, United States (2018)
- É. Zablocki, P. Bordes, L. Soulier, B. Piwowarski, P. Gallinari : “LIP6@CLEF2017: Multi-Modal Spatial Role Labeling using Word Embeddings Working notes”, CLEF 2017 Working Notes, Dublin, Ireland (2017)