Gene Selection for Patient Clustering by Gaussian Mixture Model
Md. Hadiul Kabir, Md. Shakil Ahmed , Md. Nurul Haque Mollah
Pages - 34 - 45     |    Revised - 31-10-2016     |    Published - 01-12-2016
Volume - 10   Issue - 3    |    Publication Date - December 2016  Table of Contents
Gene Expression, Patient Clustering, Gaussian Mixture Model, Inverse Problem of Covariance Matrix, Top DE genes Selection for Patient Clustering.
Clustering is the basic composition of data analysis, which also plays a significant role in microarray analysis. Gaussian mixture model (GMM) based clustering is very popular approach for clustering. However, GMM approach is not so popular for patients/samples clustering based on gene expression data, because gene expression datasets usually contains the large number (m) of genes (variables) in presence of a few (n) samples observations, and consequently the estimates of GMM parameters are not possible for patient clustering, because there does not exists the inverse of its covariance matrix due to m>n. To conquer these problems, we propose to apply a few 'q' top DE (differentially expressed) genes (i.e., q < n / 2 < m) between two or more patient classes, which are selected proportionally from all DE gene's groups. Here, the fact behind our proposal that the EE (equally expressed) genes between two or more classes have no significant contribution to the minimization of misclassification error rate (MER). For selecting few top DE genes, at first, we clustering genes (instead of patients/samples) by GMM approach. Then we detect DE and EE gene clusters (groups) by our proposed rule. Then we select q (few) top DE genes from different DE gene clusters by the rule of proportional to cluster size. Application of such a few 'q' number of top DE genes overcomes the inverse problem of covariance matrix in the estimation process of GMM's parameters, and ultimately for gene expression data (patient/sample) clustering. The performance of the proposed method is investigated using both simulated and real gene expression data analysis. It is observed that the proposed method improves the performance over the traditional GMM approaches in both situations.
