Relatore : Bernadette BOUCHON-MEUNIER
Co-relazione : LABROCHE Nicolas
Clustering is a central task of the data mining and knowledge discovery process. Nowadays, the abundance of data and the increasing size of datasets require clustering algorithms to improve and to adapt the following aspects: quality, speed, and ability to large datasets. For these reasons, the domain of clustering is extremely active.
Semi-supervised clustering has become a very interesting area of research in the last ten years whose goal is to develop clustering algorithms that allow to incorporate domain knowledge from a human expert to improve clustering performance. This knowledge can be expressed as a set of labeled data (the seeds) or a set of constraints. In the latter case, there are two main types of constraints: must-link (ML) which indicate that two points of the dataset must be in the same cluster and cannot-link (CL), which conversely require that two points belong to two different clusters. The current works in the field are particularly interested in the adaptation of existing clustering methods for the management of constraints or labeled data. However, these methods generally bear the same limitations as the method they are inspired from. Furthermore, these approach often rely on a random selection of expert knowledge (constraints, seeds) that may lead to poor performance.
To address these problems, this PhD thesis focuses on two main contributions: (1) new intelligent methods for the selection of constraints or labeled data (seeds) integrated to active learning algorithms (2) new semi-supervised clustering algorithms that improve the methods described in the literature.
In the context of intelligent collection of constraints, we propose a new utility measure of a constraint based on a k-nearest neighbors graph to identify transition zones between clusters where clustering algorithms traditionally make mistakes. This measure forms the basis of our active constraint selection algorithm that has been validated on data sets from the UCI Machine Learning Repository and in the context of a prototype application for image databases. Similarly, we propose three new methods for selecting data to be labeled that were also evaluated on real data and through a prototype application on image databases.
Finally, this thesis describes two new semi-supervised clustering algorithms: SSGC based on seeds and MCLA based on constraints that have smaller complexities, an easier parameters setting and performance comparable or better than the reference algorithms in semi-supervised clustering.
Keywords: clustering algorithm, semi-supervised clustering , active learning, seeds, constraints, k-nearest neighbors graph.
Difesa : 07/05/2011 - 14h - Site Jussieu - Salle Jean-Louis Laurière - 25-26/101
Pascale Kuntz, PR. Polytech'Nantes, Rapporteur
Eric Gaussier, PR. Univ. Grenoble I, Rapporteur
Céline Robardet, MCF INSA Lyon, Examinateur
Matthieu Cord, PR. UPMC, Examinateur
Bernadette Bouchon-Meunier, DR CNRS, Examinateur
Nicolas Labroche, MCF UPMC, Examinateur