Recent advancements in the areas of deep learning and visual information processing have presented an opportunity to unite both fields. These complementary fields combine to tackle the problem of classifying images into their semantic categories. Deep learning brings learning and representational capabilities to a visual processing model that is adapted for image classification. This thesis addresses problems that lead to the proposal of learning deep visual representations for image classification. The problem of deep learning is tackled on two fronts. The first aspect is the problem of unsupervised learning of latent representations from input data. The main focus is the integration of prior knowledge into the learning of restricted Boltzmann machines (RBM) through regularization. Regularizers are proposed to induce sparsity, selectivity and topographic organization in the coding to improve discrimination and invariance. The second direction introduces the notion of gradually transiting from unsupervised layer-wise learning to supervised deep learning. This is done through the integration of bottom-up information with top-down signals. Two novel implementations supporting this notion are explored. The first method uses top-down regularization to train a deep network of RBMs. The second method combines predictive and reconstructive loss functions to optimize a stack of encoder-decoder networks. The proposed deep learning techniques are applied to tackle the image classification problem. The bag-of-words model is adopted due to its strengths in image modeling through the use of local image descriptors and spatial pooling schemes. Deep learning with spatial aggregation is used to learn a hierarchical visual dictionary for encoding the image descriptors into mid-level representations. This method achieves leading image classification performances for object and scene images. The learned dictionaries are diverse and non-redundant. The speed of inference is also high. From this, a further optimization is performed for the subsequent pooling step. This is done by introducing a differentiable pooling parameterization and applying the error backpropagation algorithm. This thesis represents one of the first attempts to synthesize deep learning and the bag-of-words model. This union results in many challenging research problems, leaving much room for further study in this area.
Defence : 07/12/2013 - 14h30 - Site Jussieu 25-26/105 Jury members : Frédéric Jurie - Université de Caen Basse-Normandie [rapporteur]
Alain Rakotomamonjy - Université de Rouen [rapporteur]
Yann LeCun - New York University, USA
Patrick Gallinari - Université Pierre et Marie Curie
Joo-Hwee Lim - Institute for Infocomm Research, Singapore
Matthieu Cord - Université Pierre et Marie Curie
Nicolas Thome - Université Pierre et Marie Curie