In the last decade, the evolution of Deep Learning techniques to learn meaningful data representations for text and images, combined with an important increase of multimodal data, mainly from social network and e-commerce websites, has triggered a growing interest in the research community about the joint understanding of language and vision. The challenge at the heart of Multimodal Machine Learning is the intrinsic difference in semantics between language and vision: while vision faithfully represents reality and conveys low-level semantics, language is a human construction carrying high-level reasoning.
One the one hand, language can enhance the performance of vision models. The underlying hypothesis is that textual representations contain visual information. We apply this principle to two Zero-Shot Learning tasks. In the first contribution on ZSL, we extend a common assumption in ZSL, which states that textual representations encode information about the visual appearance of objects, by showing that they also encode information about their visual surroundings and their real-world frequence. In a second contribution, we consider the transductive setting in ZSL. We propose a solution to the limitations of current transductive approaches, that assume that the visual space is well-clustered, which does not hold true when the number of unknown classes is high.
On the other hand, vision can expand the capacities of language models. We demonstrate it by tackling Visual Question Generation (VQG), which extends the standard Question Generation task by using an image as complementary input, by using visual representations derived from Computer Vision.
Defence : 11/26/2020 - 09h - https://zoom.us/j/97720034876?pwd=TnVUZi9WS3J4ODlYSUh1NEFSblNFUT09
Jury members :
Mr Yannis Avrithis (INRIA Rennes-Bretagne Atlantique) [Rapporteur]
Mr Loic Barrault (University of Sheffield) [Rapporteur]
Mr Patrick Gallinari (LIP6, MLIA)
Mr Benjamin Piwowarski (LIP6, MLIA, CNRS)
Mrs Diane Bouchacourt (FAIR)
Mme Catherine Pelachaud (ISIR)
- P. Bordes : “Apprentissage Multimodal Profond pour un Raisonnement Textuel et Visuel Joint”, thesis, defence 11/26/2020, supervision Gallinari, Patrick, rapporteurs : PIWOWARSKI Benjamin (2020)
- P. Bordes, É. Zablocki, L. Soulier, B. Piwowarski, P. Gallinari : “Incorporating Visual Semantics into Sentence Representations within a Grounded Space”, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 696-707, (Association for Computational Linguistics) (2019)
- É. Zablocki, P. Bordes, B. Piwowarski, L. Soulier, P. Gallinari : “Context-Aware Zero-Shot Learning for Object Recognition”, Thirty-sixth International Conference on Machine Learning (ICML), Long Beach, CA, United States (2019)
- P. Bordes, É. Zablocki, L. Soulier, B. Piwowarski : “Un modèle multimodal d’apprentissage de représentations de phrases qui préserve la sémantique visuelle”, COnférence en Recherche d'Informations et Applications, COnférence en Recherche d'Informations et Applications - CORIA 2019, 16th French Information Retrieval Conference. Lyon, France, May 25-29, 2019. Proceedings, Lyon, France (2019)
- É. Zablocki, P. Bordes, L. Soulier, B. Piwowarski, P. Gallinari : “LIP6@CLEF2017: Multi-Modal Spatial Role Labeling using Word Embeddings Working notes”, CLEF 2017 Working Notes, Dublin, Ireland (2017)