Robust Missing Data Handling using Intelligent Machine Learning Imputation Technique for Heterogeneous Dataset
Keywords:
Heterogenous Datasets, Missing Data, Natural Language Processing, ImputationAbstract
In data analysis, the presence of missing values is a common challenge, especially in heterogeneous datasets that encompass a wide range of data types, including numerical, categorical, and unstructured data. Addressing missing data is crucial as it directly impacts the quality and reliability of subsequent analyses and modeling. This necessitates the development of robust imputation methods capable of handling diverse data types effectively. In light of the aforementioned requirement, this study presents a novel and pioneering methodology for forecasting and completing the imputed data variables throughout the dataset that contains multiple variables. The approach under consideration integrates Natural-Language-Processing (NLP) encoders, feature-extractors motivated by machine-learning, and sequential-regression imputation methods. To ascertain the practicality of the suggested approach, this study meticulously evaluates the way it performs using a well-established medical dataset of heart-disease sourced from the repository of UCI. The findings presented in this paper provide compelling evidence of the method's superiority over existing missing data imputation techniques, notably in terms of accuracy. This demonstration of practical viability and effectiveness addresses a significant concern in the field of data preprocessing and analysis, reaffirming the importance of robust imputation methods for enhancing the quality of data-driven decision-making processes.
Downloads
References
B. Al-Helali, Q. Chen, B. Xue, and M. Zhang, “A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data,” Soft Computing, Feb. 2021, doi: 10.1007/s00500-021-05590-y.
A. R. Ismail, N. Z. Abidin, and M. K. Maen, “Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare,” Journal of Robotics and Control (JRC), vol. 3, no. 2, pp. 143–152, Feb. 2022, doi: 10.18196/jrc.v3i2.13133.
L. Yu, R. Zhou, R. Chen, and K. K. Lai, “Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?,” Emerging Markets Finance and Trade, pp. 1–11, Oct. 2020, doi: 10.1080/1540496x.2020.1825935.
A. D. Woods et al., “Missing Data and Multiple Imputation Decision Tree,” PsyArXiv, Aug. 2021, doi: 10.31234/osf.io/mdw5r.
X. Miao, Y. Wu, L. Chen, Y. Gao, and J. Yin, “An Experimental Survey of Missing Data Imputation Algorithms,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–20, 2022, doi: 10.1109/tkde.2022.3186498.
R. Pavithrakannan, N. B. Fenn, S. Raman, V. Kalyanaraman, V. K. Murugananthan and J. Janarthanan, “Imputation Analysis of Central Tendencies for Classification,” 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 2021, pp. 1-7, doi: 10.1109/IEMTRONICS52119.2021.9422507.
K. Slavakis, G. N. Shetty, L. Cannelli, G. Scutari, U. Nakarmi and L. Ying, “Kernel Regression Imputation in Manifolds Via Bi-Linear Modeling: The Dynamic-MRI Case,” IEEE Transactions on Computational Imaging, vol. 8, pp. 133-147, 2022, doi: 10.1109/TCI.2022.3148062.
N. Karmitsa, S. Taheri, A. Bagirov and P. Mäkinen, “Missing Value Imputation via Clusterwise Linear Regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 4, pp. 1889-1901, 1 April 2022, doi: 10.1109/TKDE.2020.3001694.
M. Chen, H. Zhu, Y. Chen, and Y. Wang, “A Novel Missing Data Imputation Approach for Time Series Air Quality Data Based on Logistic Regression,” Atmosphere, vol. 13, no. 7, pp. 1044–1044, Jun. 2022, doi: 10.3390/atmos13071044.
D. M. P. Murti, U. Pujianto, A. P. Wibawa and M. I. Akbar, “K-Nearest Neighbor (K-NN) based Missing Data Imputation,” 2019 5th International Conference on Science in Information Technology (ICSITech), Yogyakarta, Indonesia, 2019, pp. 83-88, doi: 10.1109/ICSITech46713.2019.8987530.
B. N. Vi, D. Tan Nguyen, C. T. Tran, H. Phuc Ngo, C. C. Nguyen and H. -H. Phan, “Multiple Imputation by Generative Adversarial Networks for Classification with Incomplete Data,” 2021 RIVF International Conference on Computing and Communication Technologies (RIVF), Hanoi, Vietnam, 2021, pp. 1-6, doi: 10.1109/RIVF51545.2021.9642138.
Y. Sun, J. Li, Y. Xu, T. Zhang, and X. Wang, “Deep learning versus conventional methods for missing data imputation: A review and comparative study,” Expert Systems with Applications, vol. 227, p. 120201, Oct. 2023, doi: 10.1016/j.eswa.2023.120201.
E. O. Abiodun, A. Alabdulatif, O. I. Abiodun, M. Alawida, A. Alabdulatif, and R. S. Alkhawaldeh, “A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities,” Neural Computing and Applications, vol. 33, no. 22, pp. 15091–15118, Aug. 2021, doi: 10.1007/s00521-021-06406-8.
M. I. Gabr, Y. M. Helmy, and D. S. Elzanfaly, “Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study,” Big Data and Cognitive Computing, vol. 7, no. 1, p. 55, Mar. 2023, doi: 10.3390/bdcc7010055.
B. Mirza, W. Wang, J. Wang, H. Choi, N. C. Chung, and P. Ping, “Machine Learning and Integrative Analysis of Biomedical Big Data,” Genes, vol. 10, no. 2, Jan. 2019, doi: 10.3390/genes10020087.
M. Yumuş, M. Apaydın, A. Değirmenci and Ö. Karal, “Missing data imputation using machine learning based methods to improve HCC survival prediction,” 2020 28th Signal Processing and Communications Applications Conference (SIU), Gaziantep, Turkey, 2020, pp. 1-4, doi: 10.1109/SIU49456.2020.9302222.
S. A. Ansari, C. Sharma and T. Agarwal, “Mean and Prediction Imputation-Based Approach for Predicting Water Potability Using Machine Learning,” 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2022, pp. 1-6, doi: 10.1109/ICRITO56286.2022.9964809.
S. Tabassum, N. Abedin, R. I. Maruf, M. Taufiq Ahmed and A. Ahmed, “Improving Health Status Prediction by Applying Appropriate Missing Value Imputation Technique,” 2022 IEEE 4th Global Conference on Life Sciences and Technologies (LifeTech), Osaka, Japan, 2022, pp. 345-348, doi: 10.1109/LifeTech53646.2022.9754794.
A. Deshmukh, J. Choudhary and D. P. Singh, “Multi Kernel Scaled Deep Time Series Imputation,” 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2022, pp. 829-834, doi: 10.1109/ICACCS54159.2022.9784998.
A. Hassan and N. Yousaf, “Bankruptcy Prediction using Diverse Machine Learning Algorithms,” 2022 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 2022, pp. 106-111, doi: 10.1109/FIT57066.2022.00029.
V. Peter and Ma. Sheila, “Cardiovascular disease prediction with imputation techniques and recursive feature elimination,” Nucleation and Atmospheric Aerosols, Jan. 2023, doi: 10.1063/5.0124079.
“UCI Machine Learning Repository,” archive.ics.uci.edu. https://archive.ics.uci.edu/dataset/45/heart+disease
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.