Approaches to handle Data Imbalance Problem in Predictive Machine Learning Models: A Comprehensive Review
Keywords:
Machine Learning, Imbalanced data, Sampling, Data Preprocessing, Data Mining, Class imbalanceAbstract
The business organizations ability to grow and flourish mostly relies on how successfully it understands and utilizes the data it has collected; data has become more vital in today's society. Every company or organization at the present time accumulates massive volumes of data across a range of areas, such as finance, trade, business, and healthcare. Medical data may be provided by clinics, doctors, healthcare providers, and insurance establishments. Upon locating the necessary medical datasets, the next phases would be to investigate and utilize appropriate modeling algorithms to mine substantial information for probable prediction. Biased data is significant challenge in machine learning where the distribution of data elements in a dataset is uneven, with one class considerably outnumbering the others. This occurrence leads to biased models and reduced performance that affects quality and reliability of machine learning algorithms. This paper presents detailed review on reasons for imbalanced data, its impact, algorithmic procedures to handle unevenly distributed data. We explore various techniques, algorithms to address problem, advantages, demerits and evaluation metrics to assess performance of procedures for handling imbalanced datasets.
Downloads
References
Iqbal H. Sarrkar, “Machine Learning: Algorithms, Real-World Applications and Research Directions.” SN Computer Science, 2:160, 2021.
Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP, “Machine learning for internet of things data analysis: a survey”, Digit Commun Netw. Vol. 4, issue 3:161–175, 2018.
K. Shailaja, B, Seetharamulu , M. A. Jabbar, "Machine Learning in Healthcare: A Review," 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, pp. 910-914, 2018.
Hafsa Habehh, Sunil Gohel, “Machine in Healthcare”, Current Genomics, vol. 22, issue 4, pp, 291-300, Dec. 2021.
Dara, S., Dhamercherla, S., Jadav, S.S. et al., “ Machine Learning in Drug Discovery: A Review”, Artif Intell Rev 55, 1947–1999 2022.
Daniel Broby,"The use of predictive analytics in finance" ,The Journal of Finance and Data Science,Volume 8, pp. 145-161, 2022,
Omer Artun, Domnique Levin, “Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics and Big Data”, Wiley Publications, 2015.
Seyedan, M., Mafakheri, F,.” Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities.", Journal of Big Data 7, 53, 2020.
Marzieh Fathi, Mostafa Haghi Kashan, Seyed Mahdi Jameii, · Ebrahim Mahdipour, “ Big Data Analytics in Weather Forecasting: A Systematic Review ”, Archives of Computational Methods in Engineering, Jun 2021.
Sun, Y., Wong, A. K. C., and Kamel, M. S., “Classification of imbalanced data: a review,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 4, pp. 687–719, 2009.
Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, Gong Bing, "Learning from class-imbalanced data: Review of methods and applications" , Expert Systems with Applications,Volume 73, pp. 220-239, 2017.
Salim Rezvani, Xizhao Wang, "A broad review on class imbalance learning techniques" , Applied Soft Computing, Volume 143, 2023,
Mukhtar shah, “Imbalanced Data in Machine Learning: A Comprehensive Review”, Department of Machine Learning, University of Jumeirah.
Barandela, R., Sánchez, J. S., García, V., & Rangel, E., “Strategies for Learning in Class Imbalanced Datasets. Pattern Recognition”, 36(3), 849-851, 2003
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., “Learning from Class-imbalanced Data: Review of Methods and Applications”, Expert Systems with Applications”, 73, 220-239, 2017
Japkowicz, N., Stephen, S., “The Class Imbalance Problem: A Systematic Study”, Intelligent Data Analysis, 6(5), 429-449., 2002
Kubat, M., Matwin, S., “The Class Imbalance Problem: A Systematic Study”, Intelligent Data Analysis, 2(3), 429-449, 1998.
Chawla, N., et al., “Special issues on learning from imbalanced data sets,” ACM SigKDD Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004
Chawla, Nitesh V., et al., “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
Wongvorachan T, He S, Bulut O., “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining.”, Information, 14(1):54., 2023
Akira Tanimoto, So Yamada, Takashi Takenouchi, Masashi Sugiyama, Hisashi Kashima, "Improving imbalanced classification using near-miss instances", Expert Systems with Applications,Volume 201,2022,.
Tomek, I., ”Two Modifications of CNN”, IEEE Transactions on Systems, Man, and Cybernetics (SMC-6): 769-772, 1976
Batista, G. E., Prati, R. C., and Monard, M. C., “A study of the behavior of several methods for balancing machine learning training data.”, ACM SIGKDD Explorations Newsletter, 6(1):20–29, 2004
Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer, “Smoteboost: Improving prediction of the minority class in boosting”, In European Conference on Principles of Data Mining and Knowledge Discovery, pages 107–119. Springer, 2003.
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao, “Borderline-smote: a new over-sampling method in imbalanced data sets learning” In International Conference on Intelligent Computing, pp. 878–887, Springer, 2005.
P. Hart, “The condensed nearest neighbor rule”, IEEE Trans. Inf.bTheor., 14(3):515–516, September 2006..
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning”, In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322– 1328, 2008.
Miroslav Kubat, Stan Matwin, et al. “Addressing the curse of imbalanced training sets: one-sided selection”, In ICML, volume 97, pages 179–186. Nashville, USA, 1997.
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou, “Exploratory undersampling for class-imbalance learning”, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
Inderjeet Mani and I Zhang, “knn approach to unbalanced data distributions: a case study involving information extraction”, In Proceedings of workshop on learning from imbalanced datasets, 2003.
Hien M Nguyen, Eric W Cooper, and Katsuari Kamei, “Borderline over-sampling for imbalanced data classification”, International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1):4–21, 2011
Dennis L Wilson, “ Asymptotic properties of nearest neighbor rules using edited data”, IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421, 1972.
Ajiknya More, " Survey of resampling techniques for improving classification performance in unbalanced datasets", Computer Science Mathematics, arXiv.org, 2016
Smith, M.R., Martinez, T., Giraud-Carrier, C. An instance level analysis of data complexity”, Mach Learn 95, 225–256, 2014
Bagui, S.S., Mink, D.; Bagui, S.C., Subramaniam, S, “Determining Resampling Ratios Using BSMOTE and SVM-SMOTE for Identifying Rare Attacks in Imbalanced Cybersecurity”, Data. Computers 2023, 12, 204.
D. Utari, “Integration of SVM AND SMOTE-NC for classification of Heart Failure”, Barkekeng: J. Math. & App., vol. 17, no. 4, pp. 2263-2272, Dec. 2023.
V. S. Gaikwad, S. S. Deore, G. M. Poddar., R. V. Patil,, D. S. Hirolikar, M. P. Borawak.S. K. Swarnkar,”Unveiling Market Dynamics through Machine Learning: Strategic Insights and Analysis.”, International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 388–397, 2024
Tarambale, M., Naik, K., Patil, R. M., Patil, R. V., Deore, S. S., & Bhowmik, M. “Detecting Fraudulent Patterns: Real-Time Identification using Machine Learning”, International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 650–.660, 2024
Wei-Chao Lin, Chih-Fong Tsai, Ya-Han Hu, Jing-Shang Jhang, "Clustering-based undersampling in class-imbalanced data",Information Sciences,Volumes 409–410, pp. 17-26, 2017
P. S. Patil, S. R. Kolhe, R. V. Patil, P. M. Patil ,”The Comparison of Iris Recongition using Principal Component Analysis, Log Gabor and Gabor Wavelets”, International Journal Of Computer Applications, Vol-43, No. 1., pp. 29-33, 2012
R. V. Patil and K. C. Jondhale, "Edge based technique to estimate number of clusters in k-means color image segmentation", 2010 3rd International Conference on Computer Science and Information Technology, Chengdu, China, pp. 117-121, 2010
Chawla, N. V., Lazarevic, A., Hall, L.O., Bowyer, K.W, “ SMOTEBoost: Improving Prediction of the Minority Class in Boosting”, Lecture Notes in Computer Science, vol 2838. Springer, Berlin, Heidelberg, pp. 107-109, 2003
Hongyu Guo, Herna L Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach”, Sigkdd Explorations., vol. 6, issue 1, pp. 30-39, 2004
Freund, Y. and Schapire, R.E., “A decisiontheoretic generalization of on- line learning and an application to boosting”, Journal of Computer and System Sciences, Vol. 55, Issue 1, Pages 119-139, 1997.
Chengsheng, T., Huacheng, L., Xu, B., “AdaBoost typical Algorithm and its application research”, MATEC Web of Conferences, Vol. 139, Issue 2, 00222, France, 2017
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., “CatBoost: unbiased boosting with categorical features”, NeurIPS - 32nd Conference on Neural Information Processing Systems, Montreal, pp, 6638-6648, 2018.
Friedman, J.H. “Stochastic gradient boosting”, Computational Statistics & Data Analysis, Vol. 38, Issue 4, pp. 367-378, 2002.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y., “LightGBM: a highly efficient gradient boosting decision tree”, NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc. California, pp. 1-9, 2017.
Ma, J., Zhongqi, Y., Qu, Y., Xu, J., Cao, Y., “Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A Case Study of Shanghai”, Aerosol and Air Quality Research, Vol. 20, Issue 1, Pages 128-138, 2019.
Korau Soskun, Gürcan Çetin, “A comparative evaluation of the Boosting Algorithms for Network Classification”, International Journal of 3D Printing and Digital Technologies, 6(1), 101-112, 2022.
B Lukmanul Hakim; Bagus Sartono; Asep Saefuddi, “Bagging Based Ensemble Classification Method on Imbalance Datasets”, International Journal of Computer Science and Network, pp. 670-676, 2017
R. Barandela, R. M. Valdovinos, and J. S. S´anchez, “New applications of ensembles of classifiers,” Pattern Anal. App, Vol. 6, pp. 245–256, 2003.
J. Blaszczynski , J. Stefanowski, Szajek, ”Local Neighbourhood in Generalizing Bagging for Imbalanced Data”, COPEM ECML-PKKD. Workshop Proceedings. Solving Complex Machine Learning Problems with Ensemble Methods, 2013
N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, "Cost-sensitive learning methods for imbalanced data," The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, pp. 1-8,2010
Ibomoiye Domor Mienye, Yanxia Sun, "Performance analysis of cost-sensitive learning methods with application to imbalanced medical data”, Informatics in Medicine Unlocked, Volume 25, 2021,
Hayashi, T., Fujita, H, “One-class ensemble classifier for data imbalance problems”, Appl Intell 52, 17073–17089, 2022.
C. Li, “Classifying Imbalanced Data Using A Bagging Ensemble Variation (BEV)”, Conference: Proceedings of the 45th Annual Southeast Regional Conference, pp. 203-208, March 2007.
Ramyachitra D. Manikanda P, “ Imbalanced Dataset Classification And Solutions: A Review” International Journal of Computing and
N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, "Cost-sensitive learning methods for imbalanced data," The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, pp. 1-8,2010
Shaza M Abd Elrahman1 and Ajith Abraham, “A Review of Class Imbalance Problem” Journal of Network and Innovative Computing. Vol. 1, pp. 332-340, 2013.
Rajendra V. Patil, R. Aggarwal, ”Comprehensive Review on Image Segmentation Applications”, Sci.Int.(Lahore), 35(5), pp. 573-579, Sep. 2023
Patil, R. V., & Aggarwal, R., “Edge Information based Seed Placement Guidance to Single Seeded Region Growing Algorithm.”, International Journal of Intelligent Systems and Applications in Engineering, 12(12s), 753–759, 2024
Patil, R. V. ., Aggarwal, R. ., Poddar, G. M. ., Bhowmik, M. ., & K. Patil, M. , “Embedded Integration Strategy to Image Segmentation Using Canny Edge and K-Means Algorithm”, International Journal of Intelligent Systems and Applications in Engineering, 12(13s), 01–08. 2024
Nemade, B. ., Bharadi, V. ., Alegavi, S. S., & Marakarkandy, B., “ A Comprehensive Review: SMOTE-Based Oversampling Methods for Imbalanced Classification Techniques, Evaluation, and Result Comparisons”, International Journal of Intelligent Systems and Applications in Engineering, 11(9s), 790–803, 2023
Hui Han, Wen-Yuan Wang & Bing-Huan Mao, "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” International Conference on Intelligence Computing and Intelligent Systems (ICIS), 2005.
L. Demidova and I. Klyueva, "SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem," 2017 6th Mediterranean Conference on Embedded Computing (MECO), Bar, Montenegro, pp. 1-4, 2017.
C. Bunkhumpornpat, c. Lursinsap, "Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem", Lecture Notes in Computer Science, vol 5476. Springer, Berlin, Heidelberg.
M. Mukherjee and M. Khushi, “SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features,” Applied System Innovation, vol. 4, no. 1, p. 18, Mar. 2021
Triguero, S. García, M. Galar, J. A. Sáez, and F. Herrera, "Enhancing techniques for learning decision trees from imbalanced data," Knowledge-Based Systems, vol. 87, pp. 69-81, 2015.
Mikel Galar, Fransico, “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Application And Reviews, Vol.42, No.4 July 2012.
Gaikwad, V. S., Shivaji Deore, S., Poddar, G. M., R. V. Patil,, Sandeep Hirolikar, D. ., Pravin Borawake, M. ., & Swarnkar, S. K. . Unveiling Market Dynamics through Machine Learning: Strategic Insights and Analysis. International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 388–397. 2024
Tarambale, M. , Naik, K, Patil, R. M. , Patil, R. V. , Deore, S. S. , & Bhowmik, M. Detecting Fraudulent Patterns: Real-Time Identification using Machine Learning. International Journal of Intelligent Systems and Applications in Engineering, 12(14s), 650 –.660, 2024
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.