Adaptive Dragonfly Optimization (Ado) Feature Selection Model and Distributed Bayesian Matrix Decomposition for Big Data Analytics
Keywords:
Distributed algorithm, Bayesian matrix decomposition, clustering, data mining, feature selection, AdaptiveDragonfly Optimization (ADO), and big data.Abstract
Matrix decompositions are fundamental methods for extracting knowledge from large data sets produced by contemporary applications. Processing extremely large amounts of data using single machines are still inefficient or impractical. Distributed matrix decompositions are necessary and practical tools for big data analytics where high dimensionalities and complexities of large datasets hinder the data mining processes. Current approaches consume more execution time making it imperative to reduce dataset feature counts in processing. This work presents a novel wrapper feature selection method utilising Adaptive Dragonfly Optimisation (ADO) algorithm for making the search space more appropriate for feature selections. ADO was used to transform continuous vector search spaces into their binary representations. Distributed Bayesian Matrix Decomposition (DBMD) model is presented for clustering and mining voluminous data. This work specifically uses, 1) accelerated gradient descent, 2) alternate direction method of multipliers (ADMM), and 3) statistical inferences to model distributed computing. These algorithms' theoretical convergence behaviours are examined where tests reveal that the suggested algorithms perform better or on par with two common distributed approaches. The methods also scale up effectively to large data sets. Clustering performances are assessed using the metrics of precision, recall, F-measure, and Rand Index (RI), which are better suited for imbalanced classes.
Downloads
References
Bhadani, A.K. and Jothimani, D., 2016. Big data: challenges, opportunities, and realities. Effective big data management and opportunities for implementation, pp.1-24.
Oo, M.C.M. and Thein, T., 2022. An efficient predictive analytics system for high dimensional big data. Journal of King Saud University-Computer and Information Sciences, 34(1), pp.1521-1532.
Chavoshinejad, J., Seyedi, S.A., Tab, F.A. and Salahian, N., 2023. Self-supervised semi-supervised nonnegative matrix factorization for data clustering. Pattern Recognition, 137, p.109282.
Liu, T. and Tao, D., 2015. On the performance of manhattan nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 27(9), pp.1851-1863.
Alonso-Betanzos, A. and Bolón-Canedo, V., 2018. Big-data analysis, cluster analysis, and machine-learning approaches. Sex-specific analysis of cardiovascular function, pp.607-626.
Yang, Z., Corander, J. and Oja, E., 2016. Low-rank doubly stochastic matrix decomposition for cluster analysis. The Journal of Machine Learning Research, 17(1), pp.6454-6478.
Wang, S., Lu, J., Gu, X., Du, H. and Yang, J., 2016. Semi-supervised linear discriminant analysis for dimension reduction and classification. Pattern Recognition, 57, pp.179-189.
Kurita, T., 2019. Principal component analysis (PCA). Computer Vision: A Reference Guide, pp.1-4.
Tharwat, A., 2021. Independent component analysis: An introduction. Applied Computing and Informatics, 17(2), pp.222-249.
Nasraoui, O. and N’Cir, C.E.B., 2019. Clustering methods for big data analytics. Techniques, Toolboxes and Applications, 1, pp.91-113.
Jayasri, N.P. and Aruna, R., 2022. Big data analytics in health care by data mining and classification techniques. ICT Express, 8(2), pp.250-257.
Ayesha, S., Hanif, M.K. and Talib, R., 2020. Overview and comparative study of dimensionality reduction techniques for high dimensional data. Information Fusion, 59, pp.44-58.
Zhang C. and S. Zhang, “Bayesian joint matrix decomposition for data integration with heterogeneous noise,” IEEE Trans. Pattern Anal. Mach.Intell., pp. 1–14, 2019.
Fonał, K. and Zdunek, R., 2018. Distributed nonnegative matrix factorization with HALS algorithm on apache spark. In Artificial Intelligence and Soft Computing: 17th International Conference, ICAISC 2018, Zakopane, Poland, June 3-7, 2018, Proceedings, Part II 17 (pp. 333-342). Springer International Publishing.
Qin X., P. Blomstedt, E. Lepp¨aaho, P. Parviainen, S. Kaski, J. Davis, E. Fromont, D. Greene, and B. B. Bringmann Xiangju Qin, “Distributed Bayesian matrix factorization with limited communication,” Mach. Learn., vol. 108, pp. 1805–1830, 2019.
Lin, K.C., Zhang, K.Y., Huang, Y.H., Hung, J.C. and Yen, N., 2016. Feature selection based on an improved cat swarm optimization algorithm for big data classification. The Journal of Supercomputing, 72, pp.3210-3221.
Devi, S.G. and Sabrigiriraj, M., 2018, Feature selection, online feature selection techniques for big data classification:-a review. In 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), pp. 1-9.
Moslehi, F. and Haeri, A., 2020. An evolutionary computation-based approach for feature selection. Journal of Ambient Intelligence and Humanized Computing, 11, pp.3757-3769.
Zhang, N., Gupta, A., Chen, Z. and Ong, Y.S., 2021. Evolutionary machine learning with minions: A case study in feature selection. IEEE Transactions on Evolutionary Computation, 26(1), pp.130-144.
Wang, D., Li, T., Deng, P., Zhang, F., Huang, W., Zhang, P. and Liu, J., 2023. A Generalized Deep Learning Clustering Algorithm Based on Non-Negative Matrix Factorization. ACM Transactions on Knowledge Discovery from Data, 17(7), pp.1-20.
Zhang, C., Yang, Y., Zhou, W. and Zhang, S., 2020. Distributed Bayesian Matrix Decomposition for Big Data Mining and Clustering. IEEE Transactions on Knowledge and Data Engineering, 34(8), pp.3701-3713.
Zhang, H., Li, P., Fan, W., Xue, Z. and Meng, F., 2022. Tensor Multi-Clustering Parallel Intelligent Computing Method Based on Tensor Chain Decomposition. Computational Intelligence and Neuroscience, vol.2022, no. 7396185, pp.1-12.
Wang, Y., Zhang, W., Yu, Z., Gu, Z., Liu, H., Cai, Z., Wang, C. and Gao, S., 2017, Support vector machine based on low-rank tensor train decomposition for big data applications. In 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 850-853.
Duan, M., Li, K., Liao, X. and Li, K., 2017. A parallel multiclassification algorithm for big data using an extreme learning machine. IEEE transactions on neural networks and learning systems, 29(6), pp.2337-2351.
Xie, T., Liu, R. and Wei, Z., 2020. Improvement of the fast clustering algorithm improved by-means in the big data. Applied Mathematics and Nonlinear Sciences, 5(1), pp.1-10.
Chen, Z., Jin, S., Liu, R. and Zhang, J., 2021. A deep non-negative matrix factorization model for big data representation learning. Frontiers in Neurorobotics, 15, pp.1-9.
Tang, J. and Feng, H., 2022. Robust local-coordinate non-negative matrix factorization with adaptive graph for robust clustering. Information Sciences, 610, pp.1058-1077.
Lan, G., Lee, S. and Zhou, Y., 2020. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, 180(1-2), pp.237-284.
Jordan M. I., J. D. Lee, and Y. Yang, “Communication-efficient distributed statistical inference,” J. Am. Stat. Assoc., vol. 114, no. 526, pp. 668–681, 2019.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.