An effective approach for determining sample size that optimizes the performance of classifier

determining sample size that optimizes the performance of classifier

Authors

Keywords:

: Learning curve, model optimization, random forest, effective sample size

Abstract

The goal of machine learning is to create a model that performs well and gives accurate prediction outcome in a particular set of
classification tasks. In order to achieve higher performance, machine-learning model has to be optimized. Literature shows that parameter
and hyper-parameter tuning is most widely used for model optimization. In most classification tasks, the dataset is divided into training
and testing set with 70% and 30% for training and test respectively. However, the 70% training and 30% testing set division does not
grantee better predictive outcome for all classification. Thus, this study proposes learning curve for analysis of the effect of data size on
the performance of classification model using real world heart disease dataset employing random forest model. The experimental result
shows that data sample size has significant effect on the performance of random forest model. Learning curve is the best approach for
determining the sample size for classification task using machine-learning model.

Downloads

Download data is not yet available.

References

[1] T.A. Assegie, and P.S Nair, “Handwritten digits recognition with
decision tree classification: a machine learning approach,”
International Journal of Electrical and Computer Engineering., Vol. 9,
No. 5, October 2019, Art. no. pp. 4446~4451.
[2] T.A Assegie, R.L Tulasi, N.K Kumar, “Breast cancer prediction model
with decision tree and adaptive boosting,” IAES International Journal
of Artificial Intelligence, Vol. 10, No. 1, March 2021, Art. no. pp.
184~190.
[3] X. Chen, “Coronary Artery Disease Detection by Machine Learning with
Coronary Bifurcation Features, “Appl. Sci. 2020, 10, 7656; doi:
10.3390/app10217656.
[4] P. Sujatha and K. Mahalakshmi, “Performance Evaluation of
Supervised Machine Learning Algorithms in Prediction of Heart
disease,” 2020 IEEE International Conference for Innovation in
Technology, Nov 6-8, 2020.
[5] T.N Nguyen, “Multi-class Support Vector Machine Algorithm for Heart
Disease Classification”, International Conference on Green
Technology and Sustainable Development, 2020.
[6] M. Benllarch, “Improve Extremely Fast Decision Tree Performance
through Training Dataset Size for Early Prediction of Heart Diseases”,
IEEE, 2019.
[7] F. Tasnim, S.U Habiba, “A Comparative Study on Heart Disease
Prediction Using Data Mining Techniques and Feature Selection”,
International Conference on Robotics, Electrical and Signal
Processing Techniques, IEEE, 2021.
[8] Ni. Gupta, “Intelligent heart disease prediction in cloud environment
through ensembling”, Expert Systems,
https://doi.org/10.1111/exsy.12207, 2017.
[9] R.G. Franklin, “Survey of Heart Disease Prediction and Identification
using Machine Learning Approaches”, Proceedings of the Third
International Conference on Intelligent Sustainable Systems, 2020.
[10]R.L Fueroa, Q.Z Treitler, S. Kndula and L.H Ngo, “Predicting sample
size required for classification performance”, BMC Medical
Informatics and Decision Making 2012, 12:8.
[11]T.A Assegie, R. L Tulasi, N.K Kumar, “Breast cancer prediction model
with decision tree and adaptive boosting”, IAES International Journal
of Artificial Intelligence, Vol. 10, No. 1, March 2021, pp. 184-190.
[12] K.M Almustafa, “Prediction of heart disease and classifiers’ sensitivity
analysis”, BMC, Bioinformatics, 2020, 21:278
https://doi.org/10.1186/s12859-020-03626-y
Performance of random forest on heart disease dataset.

Downloads

Published

27.05.2022

How to Cite

Assegie, T. A. (2022). An effective approach for determining sample size that optimizes the performance of classifier: determining sample size that optimizes the performance of classifier. International Journal of Intelligent Systems and Applications in Engineering, 10(2), 222–225. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/1485

Issue

Section

Research Article