BDT: A Novel Approach to Handle Imbalanced Data in Machine Learning Models

Authors

  • Sunil Kumar Ph.D. Research Scholar, Amity Institute of Information Technology, Amity University Uttar Pradesh, Lucknow Campus India.
  • S. K. Singh Professor, Amity Institute of Information Technology, Amity University Uttar Pradesh, Lucknow Campus India.
  • Vishal Nagar Professor, Department of Computer Science and Engineering, Pranveer Singh Institute of Technology, Kanpur, Uttar Pradesh, India.

Keywords:

Data Imbalance, Machine Learning, Under-Sampling, Over-Sampling, Model Performance, Algorithm Adjustment, Imbalanced Data Correction Technique

Abstract

In the realm of machine learning and data science, the issue of imbalanced datasets presents a significant challenge, often leading to biased models and inaccurate predictions. This research introduces a novel technique aimed at mitigating the effects of data imbalance, thereby enhancing model performance across various metrics. Through a rigorous examination of existing imbalance correction methods, this study identifies key gaps and proposes an innovative approach: Balanced Data Technique (BDT) that combines under-sampling, over-sampling, and algorithmic adjustment methods in a unique framework. Employing a comprehensive experimental setup across multiple imbalanced datasets, the technique demonstrates superior performance in comparison to established methods, as evidenced by improved accuracy, precision, and recall scores. This paper details the development process of the technique, from theoretical underpinnings through to practical implementation and testing. The implications of this research are far-reaching, offering potential improvements in fields where imbalanced data is prevalent. By addressing this fundamental issue, the proposed technique contributes to the advancement of more equitable and effective machine learning models.

Downloads

Download data is not yet available.

References

Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.

He, H., & Garcia, E.A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.

Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling Imbalanced Datasets: A Review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25-36.

Fernandez, A., Garcia, S., Herrera, F., & Chawla, N.V. (2018). SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863-905.

Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. 3rd ed. Morgan Kaufmann.

Sun, Y., Wong, A.K.C., & Kamel, M.S. (2009). Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687-719.

Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-sampling TEchnique for Handling the Class Imbalanced Problem. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 475-482.

Batista, G.E.A.P.A., Prati, R.C., & Monard, M.C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations Newsletter, 6(1), 20-29.

Menardi, G., & Torelli, N. (2014). Training and Assessing Classification Rules with Imbalanced Data. Data Mining and Knowledge Discovery, 28(1), 92-122.

Garcia, S., Herrera, F. (2015). Evolutionary Under-Sampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation, 17(3), 275-306.

Hossin, M., & Sulaiman, M.N. (2015). A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1-11.

Krawczyk, B. (2016). Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence, 5(4), 221-232.

Lemaitre, G., Nogueira, F., & Aridas, C.K. (2016). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17), 1-5.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from Class-Imbalanced Data: Review of Methods and Applications. Expert Systems with Applications, 73, 220-239.

Zhou, Z.H., & Liu, X.Y. (2006). Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering, 18(1), 63-77.

Charte, F., Rivera, A.J., del Jesus, M.J., & Herrera, F. (2015). Addressing Imbalance in Multilabel Classification: Measures and Random Resampling Algorithms. Neurocomputing, 163, 3-16.

Liu, X.Y., Wu, J., & Zhou, Z.H. (2009). Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.

Buda, M., Maki, A., & Mazurowski, M.A. (2018). A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks. Neural Networks, 106, 249-259.

Johnson, J.M., & Khoshgoftaar, T.M. (2019). Survey on Deep Learning with Class Imbalance. Journal of Big Data, 6(1), 27.

Wei, J., & Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 6382-6388.

S. Barua , M.M. Islam , X. Yao , K. Murase , Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2) (2014) 405–425 .

M. Bekkar , H.K. Djemaa , T.A. Alitouche , Evaluation measures for models assessment over imbalanced data sets, J. Inf. Eng. Appl. 3 (10) (2013).

P. Branco , L. Torgo , R.P. Ribeiro , A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. (CSUR) 49 (2) (2016) 31 .

C. Bunkhumpornpat , K. Sinapiromsaran , Dbmute: density-based majority under-sampling technique, Knowl. Inf. Syst. 50 (3) (2017) 827–850 .

Pattaramon Vuttipittayamongkol , Eyad Elyan: Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Information Sciences 509 (2020) 47–70.

Bartosz Krawczyk: Learning from imbalanced data: open challenges and future directions, Prog Artif Intell (2016) 5:221–232.

Behzad Mirzaei , Bahareh Nikpour , Hossein Nezamabadi-pour: CDBH: A clustering and density-based hybrid approach for imbalanced data classification, Expert Systems With Applications 164 (2021) 114035, https://doi.org/10.1016/j.eswa.2020.114035

Downloads

Published

24.03.2024

How to Cite

Kumar, S. ., Singh, S. K. ., & Nagar, V. . (2024). BDT: A Novel Approach to Handle Imbalanced Data in Machine Learning Models. International Journal of Intelligent Systems and Applications in Engineering, 12(20s), 691–703. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5199

Issue

Section

Research Article

Most read articles by the same author(s)