KIM Young-Min
Supervision : Patrick GALLINARI
Document Clustering in a Learned Concept Space
Document clustering is one of the fundamental techniques of unsupervised learning from unstructured textual data which constitutes a real saving in terms of efficiency for various information retrieval (IR) tasks. The clustering results are not only used as basic information for the structure of a collection, but also as a preceding step before conducting other IR applications. On the other hand, probabilistic models provide a useful framework for the data analysis in unsupervised learning. They can be used as dimensionality reduction techniques providing a compact representation of a collection or as clustering techniques. Especially, topic models have been rapidly developed and became popular tools among these models.
In this thesis, we are interested in to develop effective clustering techniques which allow to find meaningful reduced spaces on which document clustering may be performed more efficiently than in the initial bag-of-words space. With this purpose, we develop four different clustering approaches for text collection using probabilistic models and more precisely with topic models. We especially try to integrate the dimensionality reduction induced by latent variables which compose a concept space and perform clustering in that space. Our experimental results confirm that our attempts are successful in terms of clustering accuracy on different data collections.
This thesis is structured in two parts. The first part presents the state-of-the-art in clustering and probabilistic models and the second part corresponds to our contributions. We first develop a two-stage clustering method applying concept space. Inspired by its success, we develop the three clustering approaches based on probabilistic latent semantic analysis (PLSA). Ext-PLSA model supplements the previous approach by combining two stages in a process. CS-PLSA algorithm allows an effective model selection for clustering. Finally, voted-PLSA provides a successful multi-view clustering procedure on a multilingual collection.
Defence : 12/16/2010
Jury members :
M. Bernd AMANN (Université Pierre et Marie Curie / Laboratoire LIP6)
M. Massih-Reza AMINI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Patrice BELLOT (Université d’Avignon / Laboratoire LIA-CERI )
M. Patrick GALLINARI (Université Pierre et Marie Curie / Laboratoire LIP6) [Directeur de thèse]
M. Eric GAUSSIER (Université Joseph Fourier / Laboratoire LIG ) [Rapporteur]
M. Pascal PONCELET (Ecole des Min d’Alès / Laboratoire LGI2P) [Rapporteur]
2008-2010 Publications
-
2010
- Y.‑M. Kim : “Apprentissage d’Espaces de Concepts pour le Partitionnement Non-SupervisĂ© de Documents Textuels”, thesis, phd defence 12/16/2010, supervision Gallinari, Patrick (2010)
- Y.‑M. Kim, M.‑R. Amini, C. Goutte, P. Gallinari : “Multiview Clustering of Multilingual Documents”, Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), Geneva, Switzerland, pp. 812-822, (ACM) (2010)
- J.‑F. Pessiot, Y.‑M. Kim, M.‑R. Amini, P. Gallinari : “Improving Document Clustering in a Learned Concept Space”, Information Processing and Management, vol. 46 (2), pp. 180-192, (Elsevier) (2010)
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Apprentissage d’un Espace de Concepts de Mots pour une Nouvelle ReprĂ©sentation des DonnĂ©es Textuelles”, Document numĂ©rique - Revue des sciences et technologies de l'information. SĂ©rie Document numĂ©rique, vol. 13 (1), pp. 63-82, (Hermès) (2010)
-
2009
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Une extension du modèle sĂ©mantique latent probabiliste pour le partitionnement non-supervisĂ© de documents textuels”, ConfĂ©rence d'apprentissage, CAP 2009, Hammamet, Tunisia (2009)
-
2008
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “An Extension of PLSA for Document Clustering”, 17th ACM Conference on Information and Knowledge Management (CIKM 2008), Napa Valley, CA, United States, pp. 1345-1346, (ACM) (2008)
- Y.‑M. Kim, J.‑F. Pessiot, M.‑R. Amini, P. Gallinari : “Apprentissage d’un espace de concepts de mots pour une nouvelle reprĂ©sentation des donnĂ©es textuelles”, COnfĂ©rence en Recherche d'Information et Applications (CORIA 2008), TrĂ©gastel, France, pp. 119-134 (2008)
- J.‑F. Pessiot, Y.‑M. Kim, M.‑R. Amini, N. Usunier, P. Gallinari : “Une mĂ©thode contextuelle d’extension de requĂŞte avec des groupements de mots pour le rĂ©sumĂ© automatique”, Conference en Recherche d'information et Applications, CORIA 2008, TrĂ©gastel, France, pp. 289-304 (2008)