Team : MLIA
Arrival date : 10/01/2016 Localisation : Campus Pierre et Marie CurieSorbonne Université - LIP6 Boîte courrier 169 Couloir 26-00, Étage 5, Bureau 525 4 place Jussieu 75252 PARIS CEDEX 05 FRANCE Tel: +33 1 44 27 51 29, Remi.Cadene (at) nulllip6.fr
Supervision : Matthieu CORD Co-supervision : THOME Nicolas
Deep multimodal learning for vision and language processing
Digital technologies have become instrumental in transforming our society. Recent statistical methods have been successfully deployed to automate the processing of the growing amount of images, videos, and texts we produce daily. In particular, deep neural networks have been adopted by the computer vision and natural language processing communities for their ability to perform accurate image recognition and text understanding once trained on big sets of data. Advances in both communities built the groundwork for new research problems at the intersection of vision and language. Integrating language into visual recognition could have an important impact on human life through the creation of real-world applications such as next-generation search engines or AI assistants. In the first part of this thesis, we focus on systems for cross-modal text-image retrieval. We propose a learning strategy to efficiently align both modalities while structuring the retrieval space with semantic information. In the second part, we focus on systems able to answer questions about an image. We propose a multimodal architecture that iteratively fuses the visual and textual modalities using a factorized bilinear model while modeling pairwise relationships between each region of the image. In the last part, we address the issues related to biases in the modeling. We propose a learning strategy to reduce the language biases which are commonly present in visual question answering systems.