Supervision : Hichem SAHBI
Nowadays, video contents are ubiquitous through the popular use of Internet and smartphones, as well as social media. Many daily life applications such as video surveillance and captioning, as well as scene understanding require sophisticated technologies which automatically analyze and interpret the large amount of available video data. In this thesis, we are interested in video action recognition, i.e. the problem of assigning action categories to sequences of videos. This can be seen as a key ingredient to build the next generation of vision systems. It is tackled with Artificial Intelligence frameworks, mainly with Machine Learning and Deep Convolutional Neural Networks (ConvNets).
Current ConvNets are increasingly deeper, data-hungrier and this makes their success tributary of the abundance of labeled training data. ConvNets also rely on (max or average) pooling which reduces dimensionality of output layers (and hence attenuates their sensitivity to the availability of labeled data); however, this process may dilute the information of upstream convolutional layers and thereby affect the discrimination power of the trained video representations, especially when the learned action categories are fine-grained. In the first part of this thesis, we introduce a hierarchical aggregation design based on tree-structured temporal pyramids, for final pooling, that controls granularity of the learned representations w.r.t. the actual granularity of action categories. Moreover, ConvNets are basically designed to handle vectorial data (such as still images) but their extension to non-vectorial and semi-structured data (namely graphs with variable sizes, topology, etc.) remains a major challenge. As a second part of this thesis, we introduce a Graph Convolutional Network model based on a spectral decomposition of graph-Laplacians. It consists in learning graph Laplacians as convex combinations of other elementary Laplacians each one dedicated to a particular topology of the input graphs. Then, we introduce a pooling operator, on graphs, which achieves permutation invariance. All models are thoroughly evaluated on standard datasets and the results are competitive w.r.t. the literature.
Keywords : Deep Video Representations, Multiple Aggregation Learning, Hierarchical Pooling, Graph Construction, Graph Pooling and Convolution, Geometric Deep Learning
1- Ahmed Mazari, Hichem Sahbi. Deep Multiple Aggregation Networks for Action Recognition. Under review at Pattern Recognition Journal. 2020
2- Ahmed Mazari, Hichem Sahbi. Coarse-to-Fine Aggregation for Cross-Granularity Action Recognition. Accepted at ICIP 2020 (to appear)
3- Ahmed Mazari, Hichem Sahbi. MLGCN: Multi-Laplacian Graph Convolutional Networks for Human Action Recognition. In the 30th British Machine Vision Conference (BMVC), 2019
4- Ahmed Mazari, Hichem Sahbi. Human Action Recognition with Deep Temporal Pyramids. In the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019
Defence : 09/22/2020 - 10h00 - Campus Jussieu, salle Jacques Pitrat (25-26/105)
M. Frédéric Dufaux, Director of Research at the CNRS, CentraleSupelec, Université Paris-Saclay, Thesis Reviewer
M. Hichem Snoussi, Professor at Université de Technologie de Troyes, Thesis Reviewer
Mme. Catherine Achard, Senior Lecturer (HDR) at Sorbonne Université - ISIR, Examiner
M. Michel Crucianu, Professor at CNAM, Paris, Examiner
Mme. Nicole Vincent, Professor at Université de Paris, Examiner
M. Hichem Sahbi, Researcher at CNRS (HDR), Sorbonne Université - LIP6, Thesis Director