Approaches to handle Data Imbalance Problem in Predictive Machine Learning Models: A Comprehensive Review

Authors

  • Govind M. Poddar, Rajendra V. Patill, Satish Kumar N

Keywords:

Machine Learning, Imbalanced data, Sampling, Data Preprocessing, Data Mining, Class imbalance

Abstract

The business organizations ability to grow and flourish mostly relies on how successfully it understands and utilizes the data it has collected; data has become more vital in today's society. Every company or organization at the present time accumulates massive volumes of data across a range of areas, such as finance, trade, business, and healthcare. Medical data may be provided by clinics, doctors, healthcare providers, and insurance establishments. Upon locating the necessary medical datasets, the next phases would be to investigate and utilize appropriate modeling algorithms to mine substantial information for probable prediction. Biased data is significant challenge in machine learning where the distribution of data elements in a dataset is uneven, with one class considerably outnumbering the others. This occurrence leads to biased models and reduced performance that affects quality and reliability of machine learning algorithms. This paper presents detailed review on reasons for imbalanced data, its impact, algorithmic procedures to handle unevenly distributed data. We explore various techniques, algorithms to address problem, advantages, demerits and evaluation metrics to assess performance of procedures for handling imbalanced datasets.

Downloads

Download data is not yet available.

References

Iqbal H. Sarrkar, “Machine Learning: Algorithms, Real-World Applications and Research Directions.” SN Computer Science, 2:160, 2021.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP, “Machine learning for internet of things data analysis: a survey”, Digit Commun Netw. Vol. 4, issue 3:161–175, 2018.

K. Shailaja, B, Seetharamulu , M. A. Jabbar, "Machine Learning in Healthcare: A Review," 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, pp. 910-914, 2018.

Hafsa Habehh, Sunil Gohel, “Machine in Healthcare”, Current Genomics, vol. 22, issue 4, pp, 291-300, Dec. 2021.

Dara, S., Dhamercherla, S., Jadav, S.S. et al., “ Machine Learning in Drug Discovery: A Review”, Artif Intell Rev 55, 1947–1999 2022.

Daniel Broby,"The use of predictive analytics in finance" ,The Journal of Finance and Data Science,Volume 8, pp. 145-161, 2022,

Omer Artun, Domnique Levin, “Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics and Big Data”, Wiley Publications, 2015.

Seyedan, M., Mafakheri, F,.” Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities.", Journal of Big Data 7, 53, 2020.

Marzieh Fathi, Mostafa Haghi Kashan, Seyed Mahdi Jameii, · Ebrahim Mahdipour, “ Big Data Analytics in Weather Forecasting: A Systematic Review ”, Archives of Computational Methods in Engineering, Jun 2021.

Sun, Y., Wong, A. K. C., and Kamel, M. S., “Classification of imbalanced data: a review,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 4, pp. 687–719, 2009.

Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, Gong Bing, "Learning from class-imbalanced data: Review of methods and applications" , Expert Systems with Applications,Volume 73, pp. 220-239, 2017.

Salim Rezvani, Xizhao Wang, "A broad review on class imbalance learning techniques" , Applied Soft Computing, Volume 143, 2023,

Mukhtar shah, “Imbalanced Data in Machine Learning: A Comprehensive Review”, Department of Machine Learning, University of Jumeirah.

Barandela, R., Sánchez, J. S., García, V., & Rangel, E., “Strategies for Learning in Class Imbalanced Datasets. Pattern Recognition”, 36(3), 849-851, 2003

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., “Learning from Class-imbalanced Data: Review of Methods and Applications”, Expert Systems with Applications”, 73, 220-239, 2017

Japkowicz, N., Stephen, S., “The Class Imbalance Problem: A Systematic Study”, Intelligent Data Analysis, 6(5), 429-449., 2002

Kubat, M., Matwin, S., “The Class Imbalance Problem: A Systematic Study”, Intelligent Data Analysis, 2(3), 429-449, 1998.

Chawla, N., et al., “Special issues on learning from imbalanced data sets,” ACM SigKDD Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004

Chawla, Nitesh V., et al., “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.

Wongvorachan T, He S, Bulut O., “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining.”, Information, 14(1):54., 2023

Akira Tanimoto, So Yamada, Takashi Takenouchi, Masashi Sugiyama, Hisashi Kashima, "Improving imbalanced classification using near-miss instances", Expert Systems with Applications,Volume 201,2022,.

Tomek, I., ”Two Modifications of CNN”, IEEE Transactions on Systems, Man, and Cybernetics (SMC-6): 769-772, 1976

Batista, G. E., Prati, R. C., and Monard, M. C., “A study of the behavior of several methods for balancing machine learning training data.”, ACM SIGKDD Explorations Newsletter, 6(1):20–29, 2004

Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer, “Smoteboost: Improving prediction of the minority class in boosting”, In European Conference on Principles of Data Mining and Knowledge Discovery, pages 107–119. Springer, 2003.

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, “Borderline-smote: a new over-sampling method in imbalanced data sets learning” In International Conference on Intelligent Computing, pp. 878–887, Springer, 2005.

P. Hart, “The condensed nearest neighbor rule”, IEEE Trans. Inf.bTheor., 14(3):515–516, September 2006..

Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning”, In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322– 1328, 2008.

Miroslav Kubat, Stan Matwin, et al. “Addressing the curse of imbalanced training sets: one-sided selection”, In ICML, volume 97, pages 179–186. Nashville, USA, 1997.

Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou, “Exploratory undersampling for class-imbalance learning”, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.

Inderjeet Mani and I Zhang, “knn approach to unbalanced data distributions: a case study involving information extraction”, In Proceedings of workshop on learning from imbalanced datasets, 2003.

Hien M Nguyen, Eric W Cooper, and Katsuari Kamei, “Borderline over-sampling for imbalanced data classification”, International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1):4–21, 2011

Dennis L Wilson, “ Asymptotic properties of nearest neighbor rules using edited data”, IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421, 1972.

Ajiknya More, " Survey of resampling techniques for improving classification performance in unbalanced datasets", Computer Science Mathematics, arXiv.org, 2016

Smith, M.R., Martinez, T., Giraud-Carrier, C. An instance level analysis of data complexity”, Mach Learn 95, 225–256, 2014

Bagui, S.S., Mink, D.; Bagui, S.C., Subramaniam, S, “Determining Resampling Ratios Using BSMOTE and SVM-SMOTE for Identifying Rare Attacks in Imbalanced Cybersecurity”, Data. Computers 2023, 12, 204.

D. Utari, “Integration of SVM AND SMOTE-NC for classification of Heart Failure”, Barkekeng: J. Math. & App., vol. 17, no. 4, pp. 2263-2272, Dec. 2023.

V. S. Gaikwad, S. S. Deore, G. M. Poddar., R. V. Patil,, D. S. Hirolikar, M. P. Borawak.S. K. Swarnkar,”Unveiling Market Dynamics through Machine Learning: Strategic Insights and Analysis.”, International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 388–397, 2024

Tarambale, M., Naik, K., Patil, R. M., Patil, R. V., Deore, S. S., & Bhowmik, M. “Detecting Fraudulent Patterns: Real-Time Identification using Machine Learning”, International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 650–.660, 2024

Wei-Chao Lin, Chih-Fong Tsai, Ya-Han Hu, Jing-Shang Jhang, "Clustering-based undersampling in class-imbalanced data",Information Sciences,Volumes 409–410, pp. 17-26, 2017

P. S. Patil, S. R. Kolhe, R. V. Patil, P. M. Patil ,”The Comparison of Iris Recongition using Principal Component Analysis, Log Gabor and Gabor Wavelets”, International Journal Of Computer Applications, Vol-43, No. 1., pp. 29-33, 2012

R. V. Patil and K. C. Jondhale, "Edge based technique to estimate number of clusters in k-means color image segmentation", 2010 3rd International Conference on Computer Science and Information Technology, Chengdu, China, pp. 117-121, 2010

Chawla, N. V., Lazarevic, A., Hall, L.O., Bowyer, K.W, “ SMOTEBoost: Improving Prediction of the Minority Class in Boosting”, Lecture Notes in Computer Science, vol 2838. Springer, Berlin, Heidelberg, pp. 107-109, 2003

Hongyu Guo, Herna L Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach”, Sigkdd Explorations., vol. 6, issue 1, pp. 30-39, 2004

Freund, Y. and Schapire, R.E., “A decisiontheoretic generalization of on- line learning and an application to boosting”, Journal of Computer and System Sciences, Vol. 55, Issue 1, Pages 119-139, 1997.

Chengsheng, T., Huacheng, L., Xu, B., “AdaBoost typical Algorithm and its application research”, MATEC Web of Conferences, Vol. 139, Issue 2, 00222, France, 2017

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., “CatBoost: unbiased boosting with categorical features”, NeurIPS - 32nd Conference on Neural Information Processing Systems, Montreal, pp, 6638-6648, 2018.

Friedman, J.H. “Stochastic gradient boosting”, Computational Statistics & Data Analysis, Vol. 38, Issue 4, pp. 367-378, 2002.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y., “LightGBM: a highly efficient gradient boosting decision tree”, NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc. California, pp. 1-9, 2017.

Ma, J., Zhongqi, Y., Qu, Y., Xu, J., Cao, Y., “Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A Case Study of Shanghai”, Aerosol and Air Quality Research, Vol. 20, Issue 1, Pages 128-138, 2019.

Korau Soskun, Gürcan Çetin, “A comparative evaluation of the Boosting Algorithms for Network Classification”, International Journal of 3D Printing and Digital Technologies, 6(1), 101-112, 2022.

B Lukmanul Hakim; Bagus Sartono; Asep Saefuddi, “Bagging Based Ensemble Classification Method on Imbalance Datasets”, International Journal of Computer Science and Network, pp. 670-676, 2017

R. Barandela, R. M. Valdovinos, and J. S. S´anchez, “New applications of ensembles of classifiers,” Pattern Anal. App, Vol. 6, pp. 245–256, 2003.

J. Blaszczynski , J. Stefanowski, Szajek, ”Local Neighbourhood in Generalizing Bagging for Imbalanced Data”, COPEM ECML-PKKD. Workshop Proceedings. Solving Complex Machine Learning Problems with Ensemble Methods, 2013

N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, "Cost-sensitive learning methods for imbalanced data," The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, pp. 1-8,2010

Ibomoiye Domor Mienye, Yanxia Sun, "Performance analysis of cost-sensitive learning methods with application to imbalanced medical data”, Informatics in Medicine Unlocked, Volume 25, 2021,

Hayashi, T., Fujita, H, “One-class ensemble classifier for data imbalance problems”, Appl Intell 52, 17073–17089, 2022.

C. Li, “Classifying Imbalanced Data Using A Bagging Ensemble Variation (BEV)”, Conference: Proceedings of the 45th Annual Southeast Regional Conference, pp. 203-208, March 2007.

Ramyachitra D. Manikanda P, “ Imbalanced Dataset Classification And Solutions: A Review” International Journal of Computing and

N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, "Cost-sensitive learning methods for imbalanced data," The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, pp. 1-8,2010

Shaza M Abd Elrahman1 and Ajith Abraham, “A Review of Class Imbalance Problem” Journal of Network and Innovative Computing. Vol. 1, pp. 332-340, 2013.

Rajendra V. Patil, R. Aggarwal, ”Comprehensive Review on Image Segmentation Applications”, Sci.Int.(Lahore), 35(5), pp. 573-579, Sep. 2023

Patil, R. V., & Aggarwal, R., “Edge Information based Seed Placement Guidance to Single Seeded Region Growing Algorithm.”, International Journal of Intelligent Systems and Applications in Engineering, 12(12s), 753–759, 2024

Patil, R. V. ., Aggarwal, R. ., Poddar, G. M. ., Bhowmik, M. ., & K. Patil, M. , “Embedded Integration Strategy to Image Segmentation Using Canny Edge and K-Means Algorithm”, International Journal of Intelligent Systems and Applications in Engineering, 12(13s), 01–08. 2024

Nemade, B. ., Bharadi, V. ., Alegavi, S. S., & Marakarkandy, B., “ A Comprehensive Review: SMOTE-Based Oversampling Methods for Imbalanced Classification Techniques, Evaluation, and Result Comparisons”, International Journal of Intelligent Systems and Applications in Engineering, 11(9s), 790–803, 2023

Hui Han, Wen-Yuan Wang & Bing-Huan Mao, "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” International Conference on Intelligence Computing and Intelligent Systems (ICIS), 2005.

L. Demidova and I. Klyueva, "SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem," 2017 6th Mediterranean Conference on Embedded Computing (MECO), Bar, Montenegro, pp. 1-4, 2017.

C. Bunkhumpornpat, c. Lursinsap, "Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem", Lecture Notes in Computer Science, vol 5476. Springer, Berlin, Heidelberg.

M. Mukherjee and M. Khushi, “SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features,” Applied System Innovation, vol. 4, no. 1, p. 18, Mar. 2021

Triguero, S. García, M. Galar, J. A. Sáez, and F. Herrera, "Enhancing techniques for learning decision trees from imbalanced data," Knowledge-Based Systems, vol. 87, pp. 69-81, 2015.

Mikel Galar, Fransico, “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Application And Reviews, Vol.42, No.4 July 2012.

Gaikwad, V. S., Shivaji Deore, S., Poddar, G. M., R. V. Patil,, Sandeep Hirolikar, D. ., Pravin Borawake, M. ., & Swarnkar, S. K. . Unveiling Market Dynamics through Machine Learning: Strategic Insights and Analysis. International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 388–397. 2024

Tarambale, M. , Naik, K, Patil, R. M. , Patil, R. V. , Deore, S. S. , & Bhowmik, M. Detecting Fraudulent Patterns: Real-Time Identification using Machine Learning. International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 650 –.660, 2024

Downloads

Published

26.03.2024

How to Cite

Govind M. Poddar, Rajendra V. Patill, Satish Kumar N. (2024). Approaches to handle Data Imbalance Problem in Predictive Machine Learning Models: A Comprehensive Review. International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 841–856. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5481

Issue

Section

Research Article

Similar Articles

You may also start an advanced similarity search for this article.