Impact of Data Pre-Processing on Covid-19 Diagnosis Using Machine Learning Algorithms


  • Dina A. Salem Computer Engineering Department, MUST University, Giza, Egypt
  • Esraa M. Hashim Biomedical Engineering Department, MUST University, Giza, Egypt


COVID-19, Deep Learning, Machine Learning, K-Nearest Neighbours, Support Vector Machine


Human coronaviruses present a significant disease burden. Identifying infected coronavirus patients using artificial intelligence draws researchers’ attention all over the world. Blood test is a striking element that can significantly contribute to provide a reliable, accurate, and quick automated detection tool of covid-19 diagnosis. Medical datasets are known to be associated with different data problems mainly, unbalancing, missing values, and amplitude variations. Performance of classifiers cannot be correctly assessed without handling those problems. For this, the paper at hand proposed multiple solutions that merge several data pre-processing techniques with three dominant classifiers namely Deep Learning (DL), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). After detailed dataset treatment, all three classifiers achieved good performance according to the gold standard with SVM scoring the highest accuracy and sensitivity of 86% and 95% respectively. This study showed the clinical soundness and feasibility of utilizing blood test analysis and machine learning as a replacement to rRT-PCR for detecting COVID-19-positive cases.


Download data is not yet available.


S. Yang, L. Jiang, Z. Cao, L. Wang, J. Cao, R. Feng, Z. Zhang, X. Xue, Y. Shi, and F. Shan, “Deep learning for detecting coronavirus disease (COVID-19) on high-resolution computed tomography: a pilot study,” Ann Transl. Med., vol. 8(7):450, Apr. 2020.

E. M. Hashim, and M. S. Mabrouk, “Protein-ligand In-silico molecular docking model for discovering potential drugs of covid-19,” Advanced Engineering Trends, vol. 42(1), Jan. 2022.

L. Wynants et al., “Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal,” BMJ, vol. 369, Mar. 2020.

N. Chen, M. Zhou, X. Dong, J. Qu, F. Gong, Y. Han, Y. Qiu, J. Wang, Y. Liu, Y. Wei, J. Xia, T. Yu, X. Zhang, and L. Zhang, “Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study,” Lancet, vol. 395, pp. 507–513, Feb. 2020.

A. M. Karim, H. Kaya, V. Alcan, and B. Sen, “New optimized deep learning application for COVID-19 detection in chest X-ray images,” Symmetry, vol. 14(1003), May2022.

Y. Haochen, Z. Nan, Z. Ruochi, D. Meiyu, X. Tianqi, P. Jiahui, P. Ejun, H. Juanjuan, Z. Yingli, X. Xiaoming, X. Hong, Z. Fengfeng, and W. Guoqing, “Severity detection for the coronavirus disease 2019 (COVID-19) patients using a machine learning model based on the blood and urine tests,” Frontiers in Cell and Developmental Biology, vol. 8, July 2020.

M. Ahishali, A. Degerli, M. Yamac, S. Kiranyaz, M. E. H. Chowdhury, K. Hameed, T. Hamid, R. Mazhar, and M. Gabbouj, “Warning methodologies for COVID-19 using chest x-ray images,” IEEE Access, vol. 9, pp. 41052–41065, Mar. 2021.

D. Li, D. Wang, J. Dong, N. Wang, H. Huang, H. Xu, and C. Xia, “False-Negative results of real-time reverse-transcriptase polymerase chain reaction for severe acute respiratory syndrome coronavirus 2: role of deep-learning-based CT diagnosis and insights from two Cases,” Korean Journal of Radiology, vol. 21(4), pp. 505–508, Apr. 2020.

P. Chatterjee, M. Biswas, and A. K. Das, “Specialized covid-19 detection techniques with machine learning,” J. Phys.: Conf. Ser, vol. 1797(1), pp. 012–033, Feb. 2021.

L. Deng, “A tutorial survey of architectures, algorithms, and applications for deep learning,” APSIPA Transactions on Signal and Information Processing, vol. 3, Jan. 2014.

M. R. H. Mondal, S. Bharati, P. Podder, and P. Podder, “Data analytics for novel coronavirus disease,” Informatics in Medicine Unlocked, vol. 20, June 2020.

L. Sun, F. Song, N. Shi, et al., “Combination of four clinical indicators predicts the severe/critical symptom of patients infected COVID-19,” Journal of Clinical Virology, vol.128, July 2020.

L. Yan, H. T. Zhang, J. Goncalves, et al., “An interpretable mortality prediction model for COVID-19 patients,” Nat Mach Intell, vol. 2(5), ppt. 283-288, May 2020.

F. Ucar, and D. Korkmaz, “COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagnosis of the coronavirus disease 2019 (COVID-19) from X-ray images,” Med Hypotheses, vol. 140, July 2020.

K. H. Abdulkareem et al., “Realizing an Effective COVID-19 Diagnosis System Based on Machine Learning and IoT in Smart Hospital Environment,” in IEEE Internet of Things Journal, vol. 8, no. 21, pp. 15919-15928, Nov. 2021.

P. Schwab, A. D. Schütte, B. Dietz, and S. Bauer, “Clinical Predictive Models for COVID-19: Systematic Study,” J. Med. Internet Res, vol. 22(60), Oct. 2020.

D. Brinati, A. Campagner, D. Ferrari, M. Locatelli, G. Banfi, and F. Cabitza, “Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study,” J Med Syst, vol. 44(8):135, July 2020.

S. Aktar, M. M. Ahamad, M. Rashed-Al-Mahfuz, A. Azad, S. Uddin, A. Kamal, et al., “Machine learning approach to predicting covid-19 disease severity based on clinical blood test data: Statistical analysis and model development,” JMIR Medical Informatics, vol. 9 (4), Apr. 2021.

A. Dairi, F. Harrou and Y. Sun, “Deep Generative Learning-Based 1-SVM Detectors for Unsupervised COVID-19 Infection Detection Using Blood Tests,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1-11, Nov. 2021.

S. Almuhaideb, M. E. B Menai, “Impact of preprocessing on medical data classification,” Front. Comput. Sci., vol.10(6), pp. 1082–1102, Oct. 2016.

Z. Zhang, “Missing values in big data research: some basic skills,” Ann Transl Med., vol. 3(21), Dec. 2015.

D. L. Langkamp, A. Lehman, and S. Lemeshow, “Techniques for handling missing data in secondary analyses of large surveys,” Acad Pediatr., vol. 10(3), pp. 205–210. May-Jun 2010.

A. R. Donders, G. j. Heijden, T. Stijnen, and k. G. Moons, “Review: a gentle introduction to imputation of missing values,” J Clin Epidemiol, vol. 59(10), pp. 1087–1091, Oct. 2006.

T. Emmanuel, T. Maupong, D. Mpoeleng, et al, “A survey on missing data in machine learning,” J Big Data, vol. 8(140), Oct. 2021.

N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician, vol. 46, no. 3, pp.175-185, Aug. 1992.

O. Altay, and M. Ulas, “Prediction of the autism spectrum disorder diagnosis with linear discriminant analysis classifier and K-nearest neighbor in children,” ISDFS, pp. 1-4, March 2018.

D. A. Salem, R. A. Abul Seoud, and Y. Kadah, “Conformational B-cell epitopes classification using machine learning techniques,” Journal of Engineering and Applied Science, Jul. 2013.

B. Schoslkopf, A. Smola, “Learning with Kernels, Support Vector Machines,” MIT, Mar. 2002.

J. Brownlee, Deep Learning with Python, 1st Ed., 2016.

Details of the applied deep learning model




How to Cite

D. . A. Salem and E. M. . Hashim, “Impact of Data Pre-Processing on Covid-19 Diagnosis Using Machine Learning Algorithms”, Int J Intell Syst Appl Eng, vol. 11, no. 1s, pp. 164–171, Jan. 2023.