A Hybrid Modified Deep Learning Data Imputation Method for Numeric Datasets

Keywords: Missing values, data imputation, deep learning, random forest

Abstract

Missing data is a major problem in terms of both machine learning and data mining methods. Like most of these methods do not work with missing data, negative results may occur on the performance of the working ones, also. Imputation is a data preprocessing method used to replace missing data with appropriate values. This study aims at developing a hybrid modified imputation method based on deep learning approach. For this purpose, we use Random Forest and Datawig deep learning imputation (called RF-DLI) methods together.  Datawig is a deep learning-based library that supports missing value imputation for all types of data. RF-DLI approach includes the following steps to impute missing data. First, the importance of each attribute of the dataset is determined with Random Forest (RF). Second, the most important 50% of the attributes are selected. Finally, each missing value is imputed with datawig (DLI) using these most important attributes. The study uses six real-world datasets from different fields with 30% missing data. The imputation performance of RF-DLI is compared to KNN, MICE, and MEAN imputation approaches in terms of MAE, RMSE, and R2 evaluation metrics. The results show that in most cases, the RF-DLI approach has better imputation performance than the other techniques mentioned.

Downloads

Download data is not yet available.

References

P. D. Allison, “Missing data techniques for structural equation modeling,” Journal of abnormal psychology, 112(4), 545, 2003.

T. D. Pigott, “A review of methods for missing data,” Educational research and evaluation, 7(4), 353-383, 2001.

M. Amiri, R. Jensen, “Missing data imputation using fuzzy-rough methods,” Neurocomputing, 205, 152-164, 2016.

G. Rahman, Z. Islam, “A decision tree-based missing value imputation technique for data pre-processing,” In Proceedings of the Ninth Australasian Data Mining Conference-Volume 121, pp. 41-50, Dec. 2011.

H. Wang, S. Wang, “Mining incomplete survey data through classification,” Knowledge and information systems, 24(2), 221-233, 2010.

A. Farhangfar, L. Kurgan, J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognition, 41(12), 3692-3705, 2008.

D.R. Rubin, “Inference and missing data,” Biometrika, 63(3), 581-592, 1976.

R. J. Little, D. B. Rubin, “Statistical analysis with missing data,” Vol. 793, John Wiley & Sons., 2019.

S. Dray, J. Josse, “Principal component analysis with missing values: a comparative survey of methods,” Plant Ecology, 216(5), 657-667, 2015.

S. A. Imtiaz, S. L. Shah, S. Narasimhan, “Missing data treatment using iterative PCA and data reconciliation,” IFAC Proceedings Volumes, 37(9), 197-202, 2004.

S. V. Buuren, K. Groothuis-Oudshoorn, “mice: Multivariate imputation by chained equations in R,” Journal of statistical software, 1-68, 2010.

K. Lakshminarayan, S. A. Harp, T. Samad, “Imputation of missing data in industrial databases,” Applied intelligence, 11(3), 259-275, 1999.

O. Troyanskaya, M. Cantor, G. Sherlock, G, P. Brown, T. Hastie, R. Tibshirani, R. B. Altman,”Missing value estimation methods for DNA microarrays,” Bioinformatics, 17(6), 520-525, 2001.

L. Folguera, J. Zupan, D. Cicerone, J. F. Magallanes, “Self-organizing maps for imputation of missing data in incomplete data matrices,” Chemometrics and Intelligent Laboratory Systems, 143, 146-151, 2015.

K. J. Nishanth, V. Ravi, “Probabilistic neural network based categorical data imputation,” Neurocomputing, 218, 17-25, 2016.

B. M. Patil, R. C. Joshi, D. Toshniwal, “Missing value imputation based on k-mean clustering with weighted distance,” In International Conference on Contemporary Computing, pp. 600-609, Springer, Berlin, Heidelberg, Aug. 2010.

N. Ankaiah, V. Ravi, “A novel soft computing hybrid for data imputation,” In Proceedings of the International Conference on Data Science. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2011.

Y. Duan, Y. Lv, W. Kang, Y. Zhao, “A deep learning based approach for traffic data imputation,” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC) , pp. 912-917, IEEE, Oct. 2014.

Y. L. Qiu, H. Zheng, O. Gevaert, “A deep learning framework for imputing missing values in genomic data,” bioRxiv, 406066, 2018.

L. Zhao,Z. Chen, Z. Yang, Y. Hu, “A hybrid method for incomplete data imputation,” In 12th International Conference on Embedded Software and Systems, pp. 1725-1730, IEEE, Aug. 2015.

I. B. Aydilek, A. Arslan, “A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks,” International Journal of Innovative Computing, Information and Control, 7(8), 4705-4717, 2012.

N. Al-Milli, W. Almobaideen, “Hybrid neural network to impute missing data for IoT applications,” In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 121-125, IEEE, Apr. 2019.

Leke, C., Marwala, T., & Paul, S. “Proposition of a theoretical model for missing data imputation using deep learning and evolutionary algorithms,” arXiv preprint arXiv:1512.01362, 2015.

X. Lai, X. Wu, L. Zhang, W. Lu, C. Zhong, “Imputations of missing values using a tracking-removed autoencoder trained with incomplete data,” Neurocomputing, 366, 54-65, 2019.

L. Gondara, K. Wang, “Mida: Multiple imputation using denoising autoencoders,” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 260-272, Springer, Cham, Jun. 2018.

H. Zhang, P. Xie, E. Xing, “Missing value imputation based on deep generative models,” arXiv preprint arXiv:1808.01684, 2018.

T. M. Whitehead, B. W. J. Irwin, P. Hunt, M. D. Segall, G. J. Conduit, “Imputation of assay bioactivity data using deep learning,” Journal of chemical information and modeling, 59(3), 1197-1204, 2019.

A. Asuncion, D. Newman, UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/index.php. Accessed on August 21, 2020

R. J. Little, D. B. Rubin, “Statistical analysis with missing data,” John Wiley & Sons, 1987.

Josse, J., Husson, F, “Handling missing values in exploratory multivariate data analysis methods,” Journal de la Société Française de Statistique, 153(2), 79-99, 2013.

Hotelling, H, “Analysis of a complex of statistical variables into principal components,” Journal of educational psychology, 24(6), 417,1933.

L. Breiman, “Random forests,” Machine learning, 45(1), 5-32, 2001.

C. Gini, “Variabilità e mutabilità,” Vamu, 1912.

J. R. Quinlan, “Induction of decision trees,” Machine learning, 1(1), 81-106, 1986.

F. Biessmann, T. Rukat, P. Schmidt, P. Naidu, S. Schelter, A. Taptunov, D. Salinas, “DataWig: Missing value imputation for tables,” Journal of Machine Learning Research, 20(175), 1-6, 2019.

Willott, C. J., & Matsuura, K. “Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance,” Climate research, 30(1), 79-82, 2005.

Barnston, A. G. “Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score,” Weather and Forecasting, 7(4), 699-709, 1992.

Barten, A. P. “The coefficient of determination for regression without a constant term.,” In The Practice of Econometrics (pp. 181-189). Springer, Dordrecht, 1987.

Pedregosa et al., “Scikit-learn: Machine Learning in Python,” JMLR 12, 2825-2830, 2011, Accessed on Sep. 10, 2020

R Foundation for Statistical. [Online]. Availible: https://www.R-project.org, 2016, Accessed on Sep. 10, 2020.

Published
2021-03-31
How to Cite
[1]
N. Peker and C. Kubat, “A Hybrid Modified Deep Learning Data Imputation Method for Numeric Datasets”, IJISAE, vol. 9, no. 1, pp. 6-11, Mar. 2021.
Section
Research Article