22/03/2012

Palestrante(s) : Mika Sato-Ilic, University of Tsukuba

Recently, analyses of high dimension low sample-size data in which the number of variables is much larger than the number of objects has gained a tremendous amount of interest from many researchers in various areas, including genomics and other bioinformatics areas. For this type of data, due to the curse of dimensionality, we tend to obtain a poor classification result. The main cause of this is noise occurring from irrelevant and redundant variables (dimensions). Therefore, we need to use “an adaptable variable selection to reduce or summarize variables.” For the summarization of variables, correlation based analysis is well-known and many applications have proven their efficiency. One example, principal component analysis (PCA) is a typical correlation based analysis which can obtain the principal components in a lower dimensional space, and based on this coordinate spanned by these components, we can obtain the similarity relationship of objects in the lower dimensional space. Therefore, this method can be used for dimension reduction purposes. However, if we apply PCA to the high-dimension low-sample size data, then mathematically we cannot obtain a solution due to the singularity of the correlation matrix with respect to variables.

In this talk, I will show that the inclusion of classification structures to the variable reduction and summarization methods can overcome these problems. We call these methods cluster harnessing analyses. First, I talk about the correlation of variables which can measure similarity between the correlation of variables and the correlation of classification structures. In addition, I will show that this correlation can be derived from the dissimilarity of data and from the classification structures for the two fixed variables which we call fuzzy self-organized dissimilarity. Second, I will talk about the variable selection criterion using the fuzzy clustering result. This criterion can show how the dissimilarity of variables at each object can match the given classification as the external information to the data. According to the value of this criterion for each variable, we can select the significant variables. In addition, based on these selected variables, I show how to obtain the data in which the number of objects is larger than the number of variables by exploiting the representation of interval-valued data. Several applications of the proposed cluster harnessing analyses for identifying a classifier of microarray data, which is a typical high dimension low sample-size data, will be demonstrated along with a new definition of marker factor instead of conventional maker gene.

Sahar.Changuel (at) nulllip6.fr