Home   >   CSC-OpenAccess Library   >    Manuscript Information
Gene Selection for Patient Clustering by Gaussian Mixture Model
Md. Hadiul Kabir, Md. Shakil Ahmed , Md. Nurul Haque Mollah
Pages - 34 - 45     |    Revised - 31-10-2016     |    Published - 01-12-2016
Volume - 10   Issue - 3    |    Publication Date - December 2016  Table of Contents
MORE INFORMATION
KEYWORDS
Gene Expression, Patient Clustering, Gaussian Mixture Model, Inverse Problem of Covariance Matrix, Top DE genes Selection for Patient Clustering.
ABSTRACT
Clustering is the basic composition of data analysis, which also plays a significant role in microarray analysis. Gaussian mixture model (GMM) based clustering is very popular approach for clustering. However, GMM approach is not so popular for patients/samples clustering based on gene expression data, because gene expression datasets usually contains the large number (m) of genes (variables) in presence of a few (n) samples observations, and consequently the estimates of GMM parameters are not possible for patient clustering, because there does not exists the inverse of its covariance matrix due to m>n. To conquer these problems, we propose to apply a few 'q' top DE (differentially expressed) genes (i.e., q < n / 2 < m) between two or more patient classes, which are selected proportionally from all DE gene's groups. Here, the fact behind our proposal that the EE (equally expressed) genes between two or more classes have no significant contribution to the minimization of misclassification error rate (MER). For selecting few top DE genes, at first, we clustering genes (instead of patients/samples) by GMM approach. Then we detect DE and EE gene clusters (groups) by our proposed rule. Then we select q (few) top DE genes from different DE gene clusters by the rule of proportional to cluster size. Application of such a few 'q' number of top DE genes overcomes the inverse problem of covariance matrix in the estimation process of GMM's parameters, and ultimately for gene expression data (patient/sample) clustering. The performance of the proposed method is investigated using both simulated and real gene expression data analysis. It is observed that the proposed method improves the performance over the traditional GMM approaches in both situations.
1 Google Scholar 
2 CiteSeerX 
3 refSeek 
4 Scribd 
5 SlideShare 
6 PdfSR 
Alizadeh, A.A., et al., "Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling." Nature, vol. 403(6769), pp. 503-511, 2000.
Alon U., Barka ,N., Notterman D.A., Gish K., Mack S.Y.D. and Levine J. "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays." Proc. Natl. Acad. Sci. USA, vol. 96, pp. 6745-6750, 1999.
Banfield J.D. and Raftery A.E. "Model-based Gaussian and non-Gaussian clustering." Biometrics, vol. 49, pp. 803-821, 1993.
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. & Yakhini, Z. "Tissue classification with gene expression profiles." Journal of Computational Biology, vol 7,pp. 559-584, 2000.
Ben-Dor, A., Shamir, R. & Yakhini, Z. "Clustering gene expression patterns", Journal of Computational Biology, vol. 6 (3-4), pp. 281-97,1999.
Bensmail H., Golek J., Moody M. M., Semmes O. J. and Haoudi A. "A novel approach for clustering proteomics data using Bayesian fast Fourier transform." Bioinformatics, vol. 21(10), pp. 2210-2224, 2005, doi:10.1093/bioinformatics/bti383.
Biernacki C. and Govaert G. "Choosing models in model-based clustering and discriminant analysis." J. Stat. Comput. Simul., vol. 64, pp. 49-71, 1999.
Biernacki C., Celeux G. and Govaert G. "An improvement of the NEC criterion for assessing the number of components arising from a mixture." Pattern Recognition Letters, vol. 20, pp. 267-272, 1999.
Biernacki C., Celeux G. and Govaert G. "Assessing a mixture model for clustering with the integrated completed likelihood." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22(7), pp. 719-725, 2000.
Bozdogan H."Mixture-model cluster analysis using a new informational complexity and model selection criteria." In Bozdogan,H. (ed.), Multivariate Statistical Modeling, vol. 2, 1994, Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, Netherlands, Dordrecht, pp.69-113.
Brazma,A., Robinson,A., Cameron,G. and Ashburner,M. "One-stop shop for microarray data." Nature, vol. 403, pp. 699-700, 2000.
Brusco M. J. and Cradit J. D. "A Variable Selection Heuristic for k-Means Clustering." Psychometrika, vol. 66, pp. 249-270, 2001.
Celeux G. and Soromenho G. "An entropy criterion for assessing the number of clusters in a mixture model". Journal of Classification, vol. 13, pp. 195-212, 1996.
Chakrabarti, K., and Mehrotra, S. "Local Dimensionality Reduction: A New Approach to Indexing High-Dimensional Spaces." The VLDB Journal, pp. 89-100, 2000.
Cheeseman P. and Stutz J. "Bayesian classification (auto-class): theory and results." In: Fayyad,U.et al. (ed.), Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 61-83, 1995.
Chickering D. M., Heckerman D. and Meek C. "A Bayesian approach to learning Bayesian networks with local structure." UAI, pp. 80-89, 1997.
D'haeseleer P. "How does gene expression clustering work?" Nat Biotechnol, vol. 23(12), pp.1499-1501, 2005.
D'haeseleer, P., Wen, X., Fuhrman, S. and Somogyi, R. "Linear modeling of mRNA expression levels during CNS development and injury." Pacific Symposium on Biocomputing vol. 4, pp. 41-52, 1999.
Datta S. and Datta, S. "Comparisons and validation of statistical clustering techniques for microarray gene expression data." Bioinformatics, vol. 19, pp. 459-466, 2003.
Desarbo W. S., Carroll J. D., Clarck L. A. and Green P. E. "Synthesized Clustering: A Method for Amalgamating Clustering Bases with Differential Weighting of Variables." Psychometrika, vol. 49, pp. 57-78, 1984.
Devaney M., and Ram A. "Efficient Feature Selection in Conceptual Clustering," in Machine Learning: Proceedings of the Fourteenth International Conference, Nashville, TN, pp. 92-97, 1997.
Ding C., He X., Zha H. and Simon H. D. "Adaptive Dimension Reduction for Clustering High-Dimensional Data," in Proceedings of the IEEE International Conference on Data Mining, Maebashi, Japan, pp. 147-154, 2002.
Dy J. G. and Brodley C. E. "Feature Subset Selection and Order Identification for Unsupervised Learning." in Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, pp. 247-254, 2000.
Fraley, C. and Raftery, A.E. "Model-Based Clustering, Discriminant Analysis, and Density Estimation."Journal of the American Statistical Association, vol. 97, pp. 611-631, 2002.
Fraley, C. and Raftery, A.E. (1998). "How many clusters? Which clustering methods? Answers via model-based cluster analysis." Computer Journal, vol. 41, pp. 578-588, 1998.
Friedman J. H. and Meulman J. J. "Clustering Objects on Subsets of Attributes." Journal of the Royal Statistical Society, Ser.B, vol. 66, pp. 1-25, 2004.
Ghosh, D. and Chinnaiyan, A. M. "Mixture modelling of gene expression data from microarray experiments." Bioinformatics, vol. 18, pp. 275-286, 2002.
Gnanadesikan R., Kettenring J. R. and Tsao, S. L. "Weighting and Selection of Variables for Cluster Analysis." Journal of Classification, vol. 12, pp. 113-136, 1995.
Golub, T.R., et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring." Science, vol. 286(5439), pp. 531-537, 1999.
Hamadeh H. K. et al. "An overview of toxic-genomics." Curr. Issues Mol. Biol., vol. 4, pp. 45-56, 2002.
Haughton D. "On the choice of a model to fit data from an exponential family." Annals of Statistics, vol. 16(1), pp. 342-355, 1988.
Kabir M. D. and Mollah M. N. H. "Outlier Modification and Gene Selection for Binary Cancer Classification using Gaussian Linear Bayes Classifier." International Journal of Biometrics and Bioinformatics (IJBB), vol. 9(2), 2015.
Kass R. E. and Wasserman L. "The Selection of Prior Distribution by Formal Rules." Journal of the American Statistical Association (JASA), vol. 91, No. 435, pp. 1343-1370, Sep 1996.
Keribin C. "Consistent estimation of the order of mixture models." The Indian Journal of Statistics. Series A 62(1), pp. 49-66, 2000.
Law, M. H., Jain, A. K., and Figueiredo, M. A. T. "Feature Selection in Mixture-Based Clustering," in Proceedings of Conference of Neural Infor-mation Processing Systems, Vancouver, 2002.
Lazzeroni L. and Owen A. "Plaid Models for Gene Expression Data," Statistica Sinica, vol. 12, pp. 61-86, 2002.
Leroux B. G. "Maximum-likelihood estimation for hidden Markov models." Stochastic Processes and their Applications, vol. 40, pp. 127-143, 1992.
Liu G., Loraine A.E., Shigeta R., Cline M., Cheng J., Valmeekam V., Sun S., Kulp D. and Siani-Rose,M.A. "NetAffx: affymetrix probesets and annotations." Nucleic Acids Res., vol. 31, pp. 82-86, 2003.
Liu J. S., Zhang J. L., Palumbo M. J. and Lawrence, C. E. "Bayesian clustering With Variable and Transformation Selections," in Bayesian Statistics, Vol.7, eds. Bernardo J.M., Bayarri M.J, Dawid A.P., Berger J.O., Heckerman D., Smith A. F. M. and West M., Oxford University Press, pp. 249-275, 2003.
McCallum A., Nigam K. and Ungar L. "Efficient Clustering of High-Dimensional Data Sets With Application to Reference Matching." in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169-178, 2000.
McLachlan G. J. and Basford K.E. "Mixture Models: Inference and Applications to Clustering." New York, Marcel Dekker, pp. xi + 259 , 1988.
McLachlan, G. J. Bean R. W. and Peel D. "A mixture model-based approach to the clustering of microarray expression data" Bioinformatics, vol. 18 (3), pp. 413-422, 2002.
McLachlan, G., and Peel, D. (2000) "Finite mixture models" New York, John Wiley & Sons.
Michael A. Beer and Saeed Tavazoie. "Predicting Gene Expression from Sequence" Cell, Vol. 117, pp. 185-198, April 2004.
Michael B. Eisen, Paul T.Spellman, Patrick O. Brown, and David Botstein. "Cluster analysis and display of ge-nome-wide expression patterns" Proc. Natl. Acad. Sci. USA, vol 95, pp. 14863-14868, Dec 1998.
Mitra, P., Murthy, C. A., and Pal, S. K. "Unsupervised Feature Selection Using FeatureSimilarity." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 301-312, 2002.
Pan W. "Incorporating gene functions as priors in model-based clustering of microarray gene expression data." Bioinformatics, vol. 22, pp. 795-801, 2006.
Pan W., Shen X., Jiang A. and Hebbel R.P. "Semi-Supervised Learning via Penalized Mixture Model with Application to Microarray Sample Classification." Bioinformatics, vol. 22, pp. 2381-2387, 2006.
Raftery A. E. "Bayesian model selection in social research." Journal of the American Statistical Association (JASA), vol. 90, No. 430, pp. 773-795, Jun 1995.
Schwarz G. "Estimating the Dimension of a Model." The Annals of Statistics, vol. 6( 2), pp. 461-464, 1978.
Scrucca L. and Raftery A. E. "clustvarsel: A Package Implementing Variable Selection for Model-based Clustering in R." Technical Report no. 629, Department of Statistics, University of Washington. Also arXiv:1411.0606, 2014.
Talavera L. "Dependency-Based Feature Selection for Clustering Symbolic Data." Intelligent Data Analysis, vol. 4, pp. 19-28, 2000.
Vaithyanathan S. and Dom B. "Generalized Model Selection for Un-supervised Learning in High Dimensions," in Proceedings of Neural Infor-mation Processing Systems ,eds. Solla S. A., Leen T. K. and Muller K.R., Cambridge, MA: MIT Press, pp. 970-976, 1999.
Wall M. E., Dyck P. A. and Brettin T. S. "SVDMAN-singular Value decomposition analysis of microarray data." Bioinformatics, vol. 17(6), pp. 566-568, 2001.
Wall M., Rechtsteiner A. and Rocha L. "Singular Value Decomposition and Principal Component Analysis." In Berrar D., Dubitzky W. and Granzow M. (eds.), "A Practical Approach to Microarray Data Analysis." Springer US, pp. 91-109, 2003.
Wall M., Rechtsteiner A. and Rocha L. "Singular Value Decomposition and Principal Component Analysis." In Berrar D., Dubitzky W. and Granzow M. (eds.), "A Practical Approach to Microarray Data Analysis." Springer US, pp. 91-109, 2003.
Wang S. and Zhu J. "Variable selection for model-based high-dimensional clustering and its application to microarray data." Biometrics, vol. 64, pp. 440-448, 2008.
Wolfe, J. H. "Object cluster analysis of social areas." Master's thesis, University of Califor-nia, Berkele, .1963.
Yeung KY, et al. "Model-based clustering and data transformations for gene expression data." Bioinformatics, vol. 17(10), pp. 977-87, 2001.
Mr. Md. Hadiul Kabir
University of Rajshahi - Bangladesh
Mr. Md. Shakil Ahmed
University of Rajshahi - Bangladesh
Professor Md. Nurul Haque Mollah
University of Rajshahi, Bangladesh - Bangladesh
mollah.stat.bio@ru.ac.bd