Development of ETL pipeline for Electronic Health Record to support Machine Learning based Approaches for Security and Prediction

Authors

  • Birendra Kumar Saraswat Computer Sciences & Engineering, GLA University, Mathura -281406, U.P., India.
  • Neeraj Varshney Department of Computer Engineering and Applications GLA University, Mathura ,UP, India
  • P. C. Vashist Department of Information Technology, GL Bajaj Institute of Technology and Management, Greater Noida- 201306, U.P., India.

Keywords:

Electronic Health Record, Extract, Transform, Load, Pipeline, Data Source, Machine Learning

Abstract

Electronic Health Records (EHRs) contain a wealth of information about a patient's medical history, treatments, and health outcomes. However, the data in EHRs is often unstructured and scattered across multiple systems, making it challenging to extract meaningful insights. Developing an extract, transform, and load (ETL) pipeline for EHRs is crucial to address this challenge. This pipeline will enable the efficient integration and transformation of EHR data into a standardized format that can be used for machine learning-based approaches for security and prediction. This paper uses an ETL pipeline to identify the data sources and types of data to be extracted. These can range from structured data, such as diagnosis codes and lab results, to unstructured data, such as doctors' notes and imaging reports. Once the data sources are identified, the pipeline needs to be designed to extract the data from these sources securely and efficiently. The proposed model transforms the extracted data into a standardized format that can be used for machine learning algorithms. It involves cleaning the data, dealing with missing values, and converting it into a structured form. The proposed model obtained 96.59% accuracy, 96.36% precision, 95.64% recall, 97.56% f1-score, 96.66% false positive rate, 93.39% false negative rate.

Downloads

Download data is not yet available.

References

Harerimana, G., Kim, J. W., Yoo, H., & Jang, B. (2019). Deep learning for electronic health records analytics. IEEE Access, 7, 101245-101259.

Latif, J., Xiao, C., Tu, S., Rehman, S. U., Imran, A., & Bilal, A. (2020). Implementation and use of disease diagnosis systems for electronic medical records based on machine learning: A complete review. IEEE Access, 8, 150489-150513.

R. Saklani, K. Purohit, S. Vats, V. Sharma, V. Kukreja and S. P. Yadav, "Multicore Implementation of K-Means Clustering Algorithm," 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 2023, pp. 171-175

Corey, K. M., Kashyap, S., Lorenzi, E., Lagoo-Deenadayalan, S. A., Heller, K., Whalen, K., ... & Sendak, M. (2018). Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study. PLoS medicine, 15(11), e1002701.

Annapragada, A. V., Donaruma-Kwoh, M. M., Annapragada, A. V., & Starosolski, Z. A. (2021). A natural language processing and deep learning approach to identify child abuse from pediatric electronic medical records. PLoS One, 16(2), e0247404.

Balch, J. A., Ruppert, M. M., Loftus, T. J., Guan, Z., Ren, Y., Upchurch, G. R., ... & Bihorac, A. (2023). Machine Learning–Enabled Clinical Information Systems Using Fast Healthcare Interoperability Resources Data Standards: Scoping Review. JMIR Medical Informatics, 11, e48297.

Ramesh, G., Logeshwaran, J., & Aravindarajan, V (2022). A Secured Database Monitoring Method to Improve Data Backup and Recovery Operations in Cloud Computing. BOHR International Journal of Computer Science, 2(1), 1-7

López-Martínez, F., Núñez-Valdez, E. R., García-Díaz, V., & Bursac, Z. (2020). A case study for a big data and machine learning platform to improve medical decision support in population health management. Algorithms, 13(4), 102.

Yadav, S. P., & Yadav, S. (2019). Mathematical implementation of fusion of medical images in continuous wavelet domain. Journal of Advanced Research in dynamical and control system, 10(10), 45-54

Miotto, R., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2018). Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6), 1236-1246.

Palanisamy, V., & Thirunavukarasu, R. (2019). Implications of big data analytics in developing healthcare frameworks–A review. Journal of King Saud University-Computer and Information Sciences, 31(4), 415-425.

Fleuren, L. M., Dam, T. A., Tonutti, M., de Bruin, D. P., Lalisang, R. C., Gommers, D., ... & Elbers, P. W. (2021). The Dutch Data Warehouse, a multicenter and full-admission electronic health records database for critically ill COVID-19 patients. Critical Care, 25, 1-12.

V. A. Mohammed, M. A. Mohammed, M. A. Mohammed, J. Logeshwaran and N. Jiwani, Machine Learning-based Evaluation of Heart Rate Variability Response in Children with Autism Spectrum Disorder, 2023 Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, 2023, pp. 1022-1028

Miller, D. D. (2020). Machine intelligence in cardiovascular medicine. Cardiology in Review, 28(2), 53-64.

Yadav, S. P., & Yadav, S. (2019). Fusion of Medical Images using a Wavelet Methodology: A Survey. In IEIE Transactions on Smart Processing & Computing (Vol. 8, Issue 4, pp. 265–271). The Institute of Electronics Engineers of Korea

Bates, D. W., Auerbach, A., Schulam, P., Wright, A., & Saria, S. (2020). Reporting and implementing interventions involving machine learning and artificial intelligence. Annals of internal medicine, 172(11_Supplement), S137-S144.

Waring, J., Lindvall, C., & Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial intelligence in medicine, 104, 101822.

Rehman, A., Naz, S., & Razzak, I. (2022). Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities. Multimedia Systems, 28(4), 1339-1371.

Haendel, M. A., Chute, C. G., Bennett, T. D., Eichmann, D. A., Guinney, J., Kibbe, W. A., ... & Gersing, K. R. (2021). The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. Journal of the American Medical Informatics Association, 28(3), 427-443.

Ben Ali, W., Pesaranghader, A., Avram, R., Overtchouk, P., Perrin, N., Laffite, S., ... & Hussin, J. G. (2021). Implementing machine learning in interventional cardiology: the benefits are worth the trouble. Frontiers in Cardiovascular Medicine, 8, 711401.

Mohammed, M. A., Mohammed, M. A., Mohammed, V. A., Logeshwaran, J., & Jiwani, N. (2023, February). An earlier serial lactate determination analysis of cardiac arrest patients using a medical machine learning model. In 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS) (pp. 263-268). IEEE

Sengan, S., Kamalam, G. K., Vellingiri, J., Gopal, J., Velayutham, P., & Subramaniyaswamy, V. (2020). Medical information retrieval systems for e-Health care records using fuzzy based machine learning model. Microprocessors and Microsystems, 103344.

Goodrum, H., Roberts, K., & Bernstam, E. V. (2020). Automatic classification of scanned electronic health record documents. International journal of medical informatics, 144, 104302.

Wang, Z. Q., & El Saddik, A. (2023). DTITD: An Intelligent Insider Threat Detection Framework Based on Digital Twin and Self-attention Based Deep Learning Models. IEEE Access.

Cremonesi, F., Planat, V., Kalokyri, V., Kondylakis, H., Sanavia, T., Resinas, V. M. M., ... & Uribe, S. (2023). The need for multimodal health data modeling: A practical approach for a federated-learning healthcare platform. Journal of Biomedical Informatics, 141, 104338.

Brito, C. V., Ferreira, P. G., Portela, B. L., Oliveira, R. C., & Paulo, J. T. (2023). Privacy-Preserving Machine Learning on Apache Spark. IEEE Access, 11, 127907-127930.

Misra, D., Avula, V., Wolk, D. M., Farag, H. A., Li, J., Mehta, Y. B., ... & Abedi, V. (2021). Early detection of septic shock onset using interpretable machine learners. Journal of Clinical Medicine, 10(2), 301.

Ozonze, O., Scott, P. J., & Hopgood, A. A. (2023). Automating electronic health record data quality assessment. Journal of Medical Systems, 47(1), 23.

Ramahlosi, M. N., & Akanbi, Y. M. A. (2023). A Blockchain-based Model for Securing Data Pipeline in a Heterogeneous Information System. Published Online by the SAICSIT 2023 Organising Committee Potchefstroom: South African Institute of Computer Scientists & In-formation Technologists, 167.

Javaid, M., Haleem, A., Singh, R. P., Suman, R., & Rab, S. (2022). Significance of machine learning in healthcare: Features, pillars and applications. International Journal of Intelligent Networks, 3, 58-73.

Gupta, U., & Sharma, R. (2024). Apache Hadoop framework for big data analytics using AI. In Artificial Intelligence and Blockchain in Industry 4.0 (pp. 130-140). CRC Press.

Keloth, V. K., Banda, J. M., Gurley, M., Heider, P. M., Kennedy, G., Liu, H., ... & Xu, H. (2023). Representing and utilizing clinical textual data for real world studies: An OHDSI approach. Journal of Biomedical Informatics, 142, 104343.

Manickam, V., & Rajasekaran Indra, M. (2023). Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management. Soft Computing, 27(1), 605-614.

Ehwerhemuepha, L., Gasperino, G., Bischoff, N., Taraman, S., Chang, A., & Feaster, W. (2020). HealtheDataLab–a cloud computing solution for data science and advanced analytics in healthcare with application to predicting multi-center pediatric readmissions. BMC medical informatics and decision making, 20, 1-12.

Sarkar, S., Pramanik, A., Maiti, J., & Reniers, G. (2020). Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Safety science, 125, 104616.

Pirmani, A., De Brouwer, E., Geys, L., Parciak, T., Moreau, Y., & Peeters, L. M. (2023). The Journey of Data Within a Global Data Sharing Initiative: A Federated 3-Layer Data Analysis Pipeline to Scale Up Multiple Sclerosis Research. JMIR Medical Informatics, 11(1), e48030.

AlZubi, A. A., Al-Maitah, M., & Alarifi, A. (2021). Cyber-attack detection in healthcare using cyber-physical system and machine learning techniques. Soft Computing, 25(18), 12319-12332.

Barron-Lugo, J. A., Gonzalez-Compean, J. L., Lopez-Arevalo, I., Carretero, J., & Martinez-Rodriguez, J. L. (2023). Xel: A cloud-agnostic data platform for the design-driven building of high-availability data science services. Future Generation Computer Systems, 145, 87-103.

https://www.kaggle.com/datasets/krsna540/synthea-dataset-jsons-ehr

Downloads

Published

24.03.2024

How to Cite

Saraswat, B. K. ., Varshney, N. ., & Vashist, P. C. . (2024). Development of ETL pipeline for Electronic Health Record to support Machine Learning based Approaches for Security and Prediction. International Journal of Intelligent Systems and Applications in Engineering, 12(19s), 168–189. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5057

Issue

Section

Research Article