Team : MLIA
Localisation : Campus Pierre et Marie CurieSorbonne Université - LIP6 Boîte courrier 169 Couloir 26-00, Étage 5, Bureau 511 4 place Jussieu 75252 PARIS CEDEX 05 FRANCE Tel: +33 1 44 27 87 33, Patrick.Bordes (at) nulllip6.fr
Supervision : Patrick GALLINARI Co-supervision : PIWOWARSKI Benjamin
Deep Multimodal Learning for Joint Textual and Visual Reasoning
In the last decade, the evolution of Deep Learning techniques to learn meaningful data representations for text and images, combined with an important increase of multimodal data, mainly from social network and e-commerce websites, has triggered a growing interest in the research community about the joint understanding of language and vision. The challenge at the heart of Multimodal Machine Learning is the intrinsic difference in semantics between language and vision: while vision faithfully represents reality and conveys low-level semantics, language is a human construction carrying high-level reasoning. One the one hand, language can enhance the performance of vision models. The underlying hypothesis is that textual representations contain visual information. We apply this principle to two Zero-Shot Learning tasks. In the first contribution on ZSL, we extend a common assumption in ZSL, which states that textual representations encode information about the visual appearance of objects, by showing that they also encode information about their visual surroundings and their real-world frequence. In a second contribution, we consider the transductive setting in ZSL. We propose a solution to the limitations of current transductive approaches, that assume that the visual space is well-clustered, which does not hold true when the number of unknown classes is high. On the other hand, vision can expand the capacities of language models. We demonstrate it by tackling Visual Question Generation (VQG), which extends the standard Question Generation task by using an image as complementary input, by using visual representations derived from Computer Vision.
P. Bordes, É. Zablocki, L. Soulier, B. Piwowarski, P. Gallinari : “Incorporating Visual Semantics into Sentence Representations within a Grounded Space”, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 696-707, (Association for Computational Linguistics) (2019)