Data Mining Techniques in Bioinformatics Analysis

Authors

  • C. Kondal Raj, R. Murugesan

Keywords:

Data Mining, Bioinformatics, Microarray Datasets, k-means

Abstract

Microarray experiments yield vast datasets containing expression data for thousands of genes across a limited number of samples, usually no more than a few dozen. A major challenge is identifying groups of genes that are co-regulated and collectively show strong associations with specific outcome variables. To tackle this challenge, we suggest using k-means clustering algorithms, which leverage external information about response variables to group genes effectively. We propose an algorithm based on logistic regression analysis that integrates gene selection, supervision, gene clustering, and sample classification into a single streamlined process. Through empirical studies on diverse microarray datasets, we demonstrate its ability to pinpoint gene clusters whose expression centroids exhibit robust predictive potential, surpassing conventional methods focused on individual gene analysis. This approach not only promises advancements in medical diagnostics and prognostics but also enhances functional genomics by offering insights into gene function and regulation.

Downloads

Download data is not yet available.

References

Nguyen D, Rocke D: Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data. Bioinformatics 2002, 18: 39–50.

Hastie T, Tibshirani R, Botstein D, Brown P: Supervised Harvesting of Expression Trees. Genome Biology 2001, 1: 1–12.

Dettling M, B¨uhlmann P: Supervised Clustering of Genes. Genome Biology 2002, 3: research 0069.1–0069.15.

J¨ornsten R, Yu B. Simultaneous Gene Clustering and Subset Selection for Sample Classification via MDL. To appear in Bioinformatics 2003.

Bickel P, Klaassen C, Ritov Y, Wellner J: Efficient and Adaptive Estimation for Semiparametric Models. John Hopkins University Press, 1993.

Dudoit S, Fridlyand J: A Prediction-Based Resampling Method to Estimate the Number of Clusters in a Dataset. Genome Biology 2002, 3(7): 0036.1– 0036.21.

Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Dataset via the Gap Statistic. Technical Report 208, Department of Statistics, Stanford University, 2000.

La Cessie S, Van Houwelingen J: Ridge Estimators in Logistic Regression. Applied Statistics 1990, 41, 191–201.

Eilers P, Boer J, Van Ommen G, Van Houwelingen H: Classification of Microarray Data with Penalized Logistic Regression. Proceedings of SPIE 2001, Volume 4266: Progress in biomedical optics and imaging, 2: 187–198.

Zhu J, Hastie T: Classification of Gene Microarrays by Penalized Logistic Regression. Preprint, Department of Statistics, Stanford University, 2002.

Dettling M, B¨uhlmann P: Boosting for Tumor Classification with Gene Expression Data. To appear in Bioinformatics 2003.

Allwein E, Schapire R, Singer Y: Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research 2000, 1: 113–141.

Hoerl A, Kennard R: Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12: 55–67.

Golub T, Slonim D, Tamayo P, Huard C, Gassenbeek M, Coller H, Loh M, Downing J, Caliguri M, Bloomfield C, Lander E: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999, 286: 531–538

Dudoit S, Fridlyand J, Speed T: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 2002, 97: 77–87.

Downloads

Published

06.08.2024

How to Cite

C. Kondal Raj. (2024). Data Mining Techniques in Bioinformatics Analysis. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 334 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6790

Issue

Section

Research Article