Data Mining-Based K-Nearest Neighbor Technique for Multiclass Dataset Feature Selection and Classification

Authors

  • R. Senthamil Selvi, K. Fathima Bibi

Keywords:

Machine learning, Hybrid, training and testing, Dataset, features, K-fold cross-validation

Abstract

Data analysis is used to extract useful information from small or large datasets and gain insights for future recommendations and decision-making. Predictive analytics is the application of data mining and machine learning techniques to make predictions. However, there are some areas for improvement in the previous algorithm, such as an optimal solution to the finite problem not being found and complicated dataset parameter selection. The previous paper, Hybrid feature selection-based Binary ACO (HFSBACO) [2], achieved 98.6%. Still, it had some difficulties; There are complex dataset stages, and prediction could be more efficient because this data requires a lot of time and resources. It is challenging to extract relevant information.

To overcome the issue, we proposed the Machine learning techniques used for Classification based on K-Nearest Neighbor (KNN) for predicting multi-dataset using features. Initially, input the Multi-dataset taken from the UCI repository. First, the Dataset was pre-trained to remove the irrelevant, missing, and noisy data. Before building the model, Feature Correlation Coefficients (FCC) between various dependent and independent features were analyzed to determine the strength of the relationship between each dependent and independent feature of the Dataset. Pre-processing data to split the train 70% and testing 30% of data for feature selection. The second stage is extracting the relevant data from the dataset-based Enhanced Binary Cuckoo Search with Ant colony optimization Algorithm (EBCS-ACO) for selecting the feature values based on its nearest feature threshold weights or values. ACO estimates the feature weights sequence order to be maintained using this algorithm. Before Classification, the K-fold cross-validation method for training and testing data metrics varies, as some ways consider iterative validation. For each sample, the quality measures were determined based on the Receiver Operating Characteristic (ROC) Curve analysis. The last step is detecting the Dataset using the K-Nearest Neighbor (KNN) algorithm and evaluating the result based on the training and testing data. Receiver operating characteristic curves serve to assess and compare classification models objectively. The classification model considers precision, recall, accuracy, f1-score, ROC, and time complexity for best prediction, which results in better accuracy and prediction rate than previous methods.

Downloads

Download data is not yet available.

References

Ruchika Singh Rajput, Dr. Jitendra Agrawal, Dr. Sanjeev Sharma"Binary Cuckoo Search based Hybrid Classification Techniques", IJCST Vol. 8, Issue 1, Jan - March 2017.

R.Senthamil Selvi, K.Fathima Bibi, “A Machine Learning‑Based Hybrid Approach to Subset Selection Using Binary Ant Colony Optimization Functions”, SN Computer Science vol.4,no.853,pp. 1-7,8 November 2023,doi: https://doi.org/10.1007/s42979-023-02251-9.

Q. Lou, Z. Deng, K. -S. Choi, H. Shen, J. Wang and S. Wang, "Robust Multilabel Relief Feature Selection Based on Fuzzy Margin Co-Optimization," in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 387-398, April 2022, doi: 10.1109/TETCI.2020.3044679.

K. Yu, L. Liu, J. Li, W. Ding and T. D. Le, "Multi-Source Causal Feature Selection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2240-2256, 1 Sept. 2020, doi: 10.1109/TPAMI.2019.2908373.

Z. Xiong, Y. Yuan and Q. Wang, "RGB-D Scene Recognition via Spatial-Related Multimodal Feature Learning," in IEEE Access, vol. 7, pp. 106739-106747, 2019, doi: 10.1109/ACCESS.2019.2932080.

M. Usman, U. K. Yusof and S. Naim, "Filter-Based Multiobjective Feature Selection Using NSGA III and Cuckoo Optimization Algorithm," in IEEE Access, vol. 8, pp. 76333-76356, 2020, doi: 10.1109/ACCESS.2020.2987057.

Y. Zhang, D. -w. Gong and J. Cheng, "Multiobjective Particle Swarm Optimization Approach for Cost-Based Feature Selection in Classification," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 1, pp. 64-75, 1 Jan.-Feb. 2017, doi: 10.1109/TCBB.2015.2476796.

T. Xu and L. Zhao, "A Structure-Induced Framework for Multilabel Feature Selection With Highly Incomplete Labels," in IEEE Access, vol. 8, pp. 71219-71230, 2020, doi: 10.1109/ACCESS.2020.2987922.

X. Zhu, S. Zhang, Y. Zhu, P. Zhu and Y. Gao, "Unsupervised Spectral Feature Selection With Dynamic Hyper-Graph Learning," in IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 6, pp. 3016-3028, 1 June 2022, doi: 10.1109/TKDE.2020.3017250.

D. R. Wijaya and F. Afianti, "Information-Theoretic Ensemble Feature Selection With Multi-Stage Aggregation for Sensor Array Optimization," in IEEE Sensors Journal, vol. 21, no. 1, pp. 476-489, 1 Jan.1, 2021, doi: 10.1109/JSEN.2020.3000756.

L. Y. Yab, N. Wahid and R. A. Hamid, "A Meta-Analysis Survey on the Usage of Meta-Heuristic Algorithms for Feature Selection on High-Dimensional Datasets," in IEEE Access, vol. 10, pp. 122832-122856, 2022, doi: 10.1109/ACCESS.2022.3221194.

N. L. S. Albashah and H. M. Rais, "Population Initialization Factor in Binary Multiobjective Grey Wolf Optimization for Features Selection," in IEEE Access, vol. 10, pp. 114942-114958, 2022, doi: 10.1109/ACCESS.2022.3218056.

Q. Al-Tashi et al., "Binary Multiobjective Grey Wolf Optimizer for Feature Selection in Classification," in IEEE Access, vol. 8, pp. 106247-106263, 2020, doi: 10.1109/ACCESS.2020.3000040.

G. Sharifai and Z. B. Zainol, "Multiple Filter-Based Rankers to Guide Hybrid Grasshopper Optimization Algorithm and Simulated Annealing for Feature Selection With High Dimensional Multiclass Imbalanced Datasets," in IEEE Access, vol. 9, pp. 74127-74142, 2021, doi: 10.1109/ACCESS.2021.3081366.

M. Ramona, G. Richard and B. David, "Multiclass Feature Selection With Kernel Gram-Matrix-Based Criteria," in IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1611-1623, Oct. 2012, doi: 10.1109/TNNLS.2012.2201748.

S. D. A. Bujang et al., "Multiclass Prediction Model for Student Grade Prediction Using Machine Learning," in IEEE Access, vol. 9, pp. 95608-95621, 2021, doi: 10.1109/ACCESS.2021.3093563.

J. Wu, P. Guo, Y. Cheng, H. Zhu, X. -B. Wang and X. Shao, "Ensemble Generalized Multiclass Support-Vector-Machine-Based Health Evaluation of Complex Degradation Systems," in IEEE/ASME Transactions on Mechatronics, vol. 25, no. 5, pp. 2230-2240, Oct. 2020, doi: 10.1109/TMECH.2020.3009449.

M. K. Keleş and Ü. Kılıç, "Artificial Bee Colony Algorithm for Feature Selection on SCADI Dataset," 2018 3rd International Conference on Computer Science and Engineering (UBMK), Sarajevo, Bosnia and Herzegovina, 2018, pp. 463-466, doi: 10.1109/UBMK.2018.8566287.

Z. Wang, X. Xiao and S. Rajasekaran, "Novel and efficient randomized algorithms for feature selection," in Big Data Mining and Analytics, vol. 3, no. 3, pp. 208-224, Sept. 2020, doi: 10.26599/BDMA.2020.9020005.

L. Gong, S. Xie, Y. Zhang, M. Wang and X. Wang, "Hybrid Feature Selection Method Based on Feature Subset and Factor Analysis," in IEEE Access, vol. 10, pp. 120792-120803, 2022, doi: 10.1109/ACCESS.2022.3222812.

S. Li and D. Wei, "Extremely High-Dimensional Feature Selection via Feature Generating Samplings," in IEEE Transactions on Cybernetics, vol. 44, no. 6, pp. 737-747, June 2014, doi: 10.1109/TCYB.2013.2269765.

C. Chen, Y. Wan, A. Ma, L. Zhang and Y. Zhong, "A Decomposition-Based Multiobjective Clonal Selection Algorithm for Hyperspectral Image Feature Selection," in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2022, Art no. 5541516, doi: 10.1109/TGRS.2022.3216685.

S. Wang et al., "Research and Experiment of Radar Signal Support Vector Clustering Sorting Based on Feature Extraction and Feature Selection," in IEEE Access, vol. 8, pp. 93322-93334, 2020, doi: 10.1109/ACCESS.2020.2993270.

Q. Yu, J. Qian, S. Jiang, Z. Wu and G. Zhang, "An Empirical Study on the Effectiveness of Feature Selection for Cross-Project Defect Prediction," in IEEE Access, vol. 7, pp. 35710-35718, 2019, doi: 10.1109/ACCESS.2019.2895614.

Spolaôr, Newton; Cherman, Everton Alvares; Monard, Maria Carolina; Lee, Huei Diana (2013). A Comparison of Multilabel Feature Selection Methods using the Problem Transformation Approach. Electronic Notes in Theoretical Computer Science, 292(), 135–151. doi:10.1016/j.entcs.2013.02.010.

N. Laopracha, K. Sunat and S. Chiewchanwattana, "A Novel Feature Selection in Vehicle Detection Through the Selection of Dominant Patterns of Histograms of Oriented Gradients (DPHOG)," in IEEE Access, vol. 7, pp. 20894-20919, 2019, doi: 10.1109/ACCESS.2019.2893320.

L. Sun, T. Yin, W. Ding and J. Xu, "Hybrid Multilabel Feature Selection Using BPSO and Neighborhood Rough Sets for Multilabel Neighborhood Decision Systems," in IEEE Access, vol. 7, pp. 175793-175815, 2019, doi: 10.1109/ACCESS.2019.2957662.

X. -T. Wang and X. -Z. Luan, "Bayesian Penalized Method for Streaming Feature Selection," in IEEE Access, vol. 7, pp. 103815-103822, 2019, doi: 10.1109/ACCESS.2019.2930346.

F. Nie, S. Yang, R. Zhang and X. Li, "A General Framework for Auto-Weighted Feature Selection via Global Redundancy Minimization," in IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2428-2438, May 2019, doi: 10.1109/TIP.2018.2886761.

Y. Tian, J. Zhang, L. Li and Z. Liu, "A Novel Sensor-Based Human Activity Recognition Method Based on Hybrid Feature Selection and Combinational Optimization," in IEEE Access, vol. 9, pp. 107235-107249, 2021, doi: 10.1109/ACCESS.2021.3100580.

Wiharto, E. Suryani, S. Setyawan and B. P. Putra, "The Cost-Based Feature Selection Model for Coronary Heart Disease Diagnosis System Using Deep Neural Network," in IEEE Access, vol. 10, pp. 29687-29697, 2022, doi: 10.1109/ACCESS.2022.3158752.

H. C. S. C. Lima, F. E. B. Otero, L. H. C. Merschmann and M. J. F. Souza, "A Novel Hybrid Feature Selection Algorithm for Hierarchical Classification," in IEEE Access, vol. 9, pp. 127278-127292, 2021, doi: 10.1109/ACCESS.2021.3112396.

Malek Alzaqebah;Khaoula Briki;Nashat Alrefai;Sami Brini;Sana Jawarneh;Mutasem K. Alsmadi;Rami Mustafa A. Mohammad;Ibrahim ALmarashdeh;Fahad A. Alghamdi;Nahier Aldhafferi;Abdullah Alqahtani; (2021). Memory based Binary Cuckoo Search algorithm for feature selection of gene expression dataset . Informatics in Medicine Unlocked, (), –. doi:10.1016/j.imu.2021.100572

R. Devi Priya, R. Sivaraj, N. Anitha, V. Devisurya, Tri-staged feature selection in multiclass heterogeneous datasets using memetic algorithm and Binary Cuckoo Search optimization,Expert Systems with Applications,Volume 209,2022,118286,ISSN 0957-4174,https://doi.org/10.1016/j.eswa.2022.118286.

Kashef, Shima; Nezamabadi-pour, Hossein (2015). An advanced ACO algorithm for feature subset selection. Neurocomputing, 147(), 271–279. doi:10.1016/j.neucom.2014.06.067.

Esra Saraç, Selma Ayşe Özel, "An Ant Colony Optimization Based Feature Selection for Web Page Classification", The Scientific World Journal, vol. 2014, Article ID 649260, pp. 1-14, 2014.

Gite, S.; Patil, S.; Dharrao, D.; Yadav, M.; Basak, S.; Rajendran, A.; Kotecha, K. Textual Feature Extraction Using Ant Colony Optimization for Hate Speech Classification. Big Data Cogn. Comput. 2023, 7, 45.

NK. Sreeja, A. Sankar,Pattern Matching based Classification using Ant Colony Optimization based Feature Selection, Applied Soft Computing, Volume 31, 2015, Pages 91-102.

Chakravarty, Sujata & Mohapatra, Puspanjali. (2015). Multiclass Classification using Binary Cuckoo Search-based hybrid network. 953-960. 10.1109/PCITC.2015.7438134.

R. Devi Priya, R. Sivaraj, N. Anitha, V. Devisurya,Tri-staged feature selection in multiclass heterogeneous datasets using memetic algorithm and Binary Cuckoo Search optimization,Expert Systems with Applications,Volume 209, 2022.

Qinwei Fan, Tongke Fan, "A Hybrid Model of Extreme Learning Machine Based on Bat and Binary Cuckoo Search Algorithm for Regression and Multiclass Classification", Journal of Mathematics, vol. 2021, Article ID 4404088, pp. 11, 2021.

Downloads

Published

26.03.2024

How to Cite

R. Senthamil Selvi. (2024). Data Mining-Based K-Nearest Neighbor Technique for Multiclass Dataset Feature Selection and Classification. International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 2469–2488. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5850

Issue

Section

Research Article