Robust Missing Data Handling using Intelligent Machine Learning Imputation Technique for Heterogeneous Dataset

Authors

  • Sowmya Venkatesh Research Scholar Dept. of Computer Science and Engineering Dr. Ambedkar Institute of Technology Bengaluru, Karnataka, India Affiliated to Visvesvaraya Technological University, Belagavi-590018
  • Maragal Venkatamuni Vijay Kumar Research Supervisor Dept. of Information Science and Engineering Dr. Ambedkar Institute of Technology Bengaluru, Karnataka, India Affiliated to Visvesvaraya Technological University, Belagavi-590018
  • Ashoka Davanageri Virupakshappa Co-Supervisor Dept. of Information Science and Engineering JSS Academy of Technical Education Bengaluru, Karnataka Affiliated to Visvesvaraya Technological University, Belagavi-590018

Keywords:

Heterogenous Datasets, Missing Data, Natural Language Processing, Imputation

Abstract

In data analysis, the presence of missing values is a common challenge, especially in heterogeneous datasets that encompass a wide range of data types, including numerical, categorical, and unstructured data. Addressing missing data is crucial as it directly impacts the quality and reliability of subsequent analyses and modeling. This necessitates the development of robust imputation methods capable of handling diverse data types effectively. In light of the aforementioned requirement, this study presents a novel and pioneering methodology for forecasting and completing the imputed data variables throughout the dataset that contains multiple variables. The approach under consideration integrates Natural-Language-Processing (NLP) encoders, feature-extractors motivated by machine-learning, and sequential-regression imputation methods. To ascertain the practicality of the suggested approach, this study meticulously evaluates the way it performs using a well-established medical dataset of heart-disease sourced from the repository of UCI. The findings presented in this paper provide compelling evidence of the method's superiority over existing missing data imputation techniques, notably in terms of accuracy. This demonstration of practical viability and effectiveness addresses a significant concern in the field of data preprocessing and analysis, reaffirming the importance of robust imputation methods for enhancing the quality of data-driven decision-making processes.

Downloads

Download data is not yet available.

References

B. Al-Helali, Q. Chen, B. Xue, and M. Zhang, “A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data,” Soft Computing, Feb. 2021, doi: 10.1007/s00500-021-05590-y.

A. R. Ismail, N. Z. Abidin, and M. K. Maen, “Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare,” Journal of Robotics and Control (JRC), vol. 3, no. 2, pp. 143–152, Feb. 2022, doi: 10.18196/jrc.v3i2.13133.

L. Yu, R. Zhou, R. Chen, and K. K. Lai, “Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?,” Emerging Markets Finance and Trade, pp. 1–11, Oct. 2020, doi: 10.1080/1540496x.2020.1825935.

A. D. Woods et al., “Missing Data and Multiple Imputation Decision Tree,” PsyArXiv, Aug. 2021, doi: 10.31234/osf.io/mdw5r.

X. Miao, Y. Wu, L. Chen, Y. Gao, and J. Yin, “An Experimental Survey of Missing Data Imputation Algorithms,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–20, 2022, doi: 10.1109/tkde.2022.3186498.

R. Pavithrakannan, N. B. Fenn, S. Raman, V. Kalyanaraman, V. K. Murugananthan and J. Janarthanan, “Imputation Analysis of Central Tendencies for Classification,” 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 2021, pp. 1-7, doi: 10.1109/IEMTRONICS52119.2021.9422507.

K. Slavakis, G. N. Shetty, L. Cannelli, G. Scutari, U. Nakarmi and L. Ying, “Kernel Regression Imputation in Manifolds Via Bi-Linear Modeling: The Dynamic-MRI Case,” IEEE Transactions on Computational Imaging, vol. 8, pp. 133-147, 2022, doi: 10.1109/TCI.2022.3148062.

N. Karmitsa, S. Taheri, A. Bagirov and P. Mäkinen, “Missing Value Imputation via Clusterwise Linear Regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 4, pp. 1889-1901, 1 April 2022, doi: 10.1109/TKDE.2020.3001694.

M. Chen, H. Zhu, Y. Chen, and Y. Wang, “A Novel Missing Data Imputation Approach for Time Series Air Quality Data Based on Logistic Regression,” Atmosphere, vol. 13, no. 7, pp. 1044–1044, Jun. 2022, doi: 10.3390/atmos13071044.

D. M. P. Murti, U. Pujianto, A. P. Wibawa and M. I. Akbar, “K-Nearest Neighbor (K-NN) based Missing Data Imputation,” 2019 5th International Conference on Science in Information Technology (ICSITech), Yogyakarta, Indonesia, 2019, pp. 83-88, doi: 10.1109/ICSITech46713.2019.8987530.

B. N. Vi, D. Tan Nguyen, C. T. Tran, H. Phuc Ngo, C. C. Nguyen and H. -H. Phan, “Multiple Imputation by Generative Adversarial Networks for Classification with Incomplete Data,” 2021 RIVF International Conference on Computing and Communication Technologies (RIVF), Hanoi, Vietnam, 2021, pp. 1-6, doi: 10.1109/RIVF51545.2021.9642138.

Y. Sun, J. Li, Y. Xu, T. Zhang, and X. Wang, “Deep learning versus conventional methods for missing data imputation: A review and comparative study,” Expert Systems with Applications, vol. 227, p. 120201, Oct. 2023, doi: 10.1016/j.eswa.2023.120201.

E. O. Abiodun, A. Alabdulatif, O. I. Abiodun, M. Alawida, A. Alabdulatif, and R. S. Alkhawaldeh, “A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities,” Neural Computing and Applications, vol. 33, no. 22, pp. 15091–15118, Aug. 2021, doi: 10.1007/s00521-021-06406-8.

M. I. Gabr, Y. M. Helmy, and D. S. Elzanfaly, “Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study,” Big Data and Cognitive Computing, vol. 7, no. 1, p. 55, Mar. 2023, doi: 10.3390/bdcc7010055.

B. Mirza, W. Wang, J. Wang, H. Choi, N. C. Chung, and P. Ping, “Machine Learning and Integrative Analysis of Biomedical Big Data,” Genes, vol. 10, no. 2, Jan. 2019, doi: 10.3390/genes10020087.

M. Yumuş, M. Apaydın, A. Değirmenci and Ö. Karal, “Missing data imputation using machine learning based methods to improve HCC survival prediction,” 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 2020, pp. 1-4, doi: 10.1109/SIU49456.2020.9302222.

S. A. Ansari, C. Sharma and T. Agarwal, “Mean and Prediction Imputation-Based Approach for Predicting Water Potability Using Machine Learning,” 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2022, pp. 1-6, doi: 10.1109/ICRITO56286.2022.9964809.

S. Tabassum, N. Abedin, R. I. Maruf, M. Taufiq Ahmed and A. Ahmed, “Improving Health Status Prediction by Applying Appropriate Missing Value Imputation Technique,” 2022 IEEE 4th Global Conference on Life Sciences and Technologies (LifeTech), Osaka, Japan, 2022, pp. 345-348, doi: 10.1109/LifeTech53646.2022.9754794.

A. Deshmukh, J. Choudhary and D. P. Singh, “Multi Kernel Scaled Deep Time Series Imputation,” 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2022, pp. 829-834, doi: 10.1109/ICACCS54159.2022.9784998.

A. Hassan and N. Yousaf, “Bankruptcy Prediction using Diverse Machine Learning Algorithms,” 2022 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 2022, pp. 106-111, doi: 10.1109/FIT57066.2022.00029.

V. Peter and Ma. Sheila, “Cardiovascular disease prediction with imputation techniques and recursive feature elimination,” Nucleation and Atmospheric Aerosols, Jan. 2023, doi: 10.1063/5.0124079.

“UCI Machine Learning Repository,” archive.ics.uci.edu. https://archive.ics.uci.edu/dataset/45/heart+disease

Downloads

Published

24.03.2024

How to Cite

Venkatesh, S. ., Vijay Kumar, M. V. ., & Virupakshappa, A. D. . (2024). Robust Missing Data Handling using Intelligent Machine Learning Imputation Technique for Heterogeneous Dataset. International Journal of Intelligent Systems and Applications in Engineering, 12(18s), 111–120. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4956

Issue

Section

Research Article