Utilization of Genetic Algorithm and Significance Scores for Feature Selection in the Interest of Increasing Accuracy of Fault Detection in Hard Disk Drives for HDFS
Keywords:
Genetic Algorithm, Significance Scores, Fault Detection in HDD, SMARTAbstract
The term "hard disk drive" (HDD) refers to a storage device used in computers and servers. If these components suddenly stop working, vital information could be lost forever. Most hard disk drives (HDD) include SMART technology, which allows them to track a variety of performance metrics and report on their own health status. However, not all SMART characteristics may be relied upon to spot a failing HDD. In this research, we offer a two-stage process for choosing the best HDD failure indicators. First, a GA is used to narrow down the SMART qualities to a manageable set that yields feature vectors that are intuitive to separate and naturally cluster. The best subset of features is determined by the GA based solely on the fitness of a set of SMART attribute pairs. The use of a significance score to measure a feature's statistical impact to disk failures in a second layer is suggested to improve the GA's feature selection even more. This hand-picked collection of SMART traits is used to train the naive Bayes classifier, a generative classifier. The suggested approach outperforms cutting-edge alternatives in terms of failure detection and false alarm rate, according to extensive testing on a SMART dataset obtained from a commercial datacentre. There is no need to fine-tune any parameters or thresholds, and the classifier just needs to be trained on a smaller set of SMART properties.
Downloads
References
Coursey, G. Nath, S. Prabhu and S. Sengupta, "Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks," 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp. 4832-4841, doi: 10.1109/BigData52589.2021.9671605.
"A multi-instance LSTM network for failure detection of hard disk drives," 2020 IEEE 18th International Conference on Industrial Informatics (INDIN), Warwick, United Kingdom, 2020, pp. 709-712, doi: 10.1109/INDIN45582.2020.9442240.
F. L. F. Pereira, I. Castro Chaves, J. P. P. Gomes and J. C. Machado, "Using Autoencoders for Anomaly Detection in Hard Disk Drives," 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1-7, doi: 10.1109/IJCNN48605.2020.9206689.
Lee, C. & Cao, Yi & Ng, Kam K.H.. (2017). Big Data Analytics for Predictive Maintenance Strategies. 10.4018/978-1-5225-0956-1.ch004.
G. Wang, Y. Wang and X. Sun, "Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives," in IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1-9, 2021, Art no. 3513509, doi: 10.1109/TIM.2021.3068180.
J. Zeng, R. Ba, Q. Chen, L. Wu, H. Wang and Y. Xiong, "Prediction of Hard Drive Failures for Data Center Based on LightGBM," 2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom), Xi'an, China, 2022, pp. 105-110, doi: 10.1109/CSCloud-EdgeCom54986.2022.00027.
M. Simongyi and P. Chongstitvatana, "Machine Learning Methods for Abnormality Detection in Hard Disk Drive Assembly Process: Bi-LSTM, Wavelet-CNN and SVM," 2018 2nd European Conference on Electrical Engineering and Computer Science (EECS), Bern, Switzerland, 2018, pp. 392-399, doi: 10.1109/EECS.2018.00079.
L. P. Queiroz et al., "A Fault Detection Method for Hard Disk Drives Based on Mixture of Gaussians and Nonparametric Statistics," in IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 542-550, April 2017, doi: 10.1109/TII.2016.2619180.
Prafullata Auradkar et al., Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSD, Procedia Computer Science 167 (2020) 2253–2266
Mukhtaj Khan, Zhengwen Huang, Maozhen Li, Gareth A. Taylor, Phillip M. Ashton, Mushtaq Khan, "Optimizing Hadoop Performance for Big Data Analytics in Smart Grid", Mathematical Problems in Engineering, vol. 2017, Article ID 2198262, 11 pages, 2017.
https://doi.org/10.1155/2017/2198262
Ahmad, S.G., Liew, C.S., Munir, E.U., Ang, T.F. and Khan, S.U., (2016). A hybrid genetic algorithm for optimization of scheduling workflow applications in heterogeneous computing systems. Journal of Parallel and Distributed Computing, 87, pp.80-90.
Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bag of Words Data Set.
[online] Available at: https://archive.ics.uci.edu/ml/datasets/bag+of+words [Accessed 9 Apr. 2019]
Dai, W., Ibrahim, I. and Bassiouni, M., (2017), June. An improved replica placement policy for Hadoop Distributed File System running on Cloud platforms. In Cyber Security and Cloud Computing (CSCloud), 2017 IEEE 4th International Conference on (pp. 270-275). IEEE.
Dharanipragada, J., Padala, S., Kammili, B. and Kumar, V., (2017). Tula: A disk latency aware balancing and block placement strategy for Hadoop. In Big Data (Big Data), 2017 IEEE International Conference on (pp. 2853-2858). IEEE.
Docs.gluster.org. (2018). Home - Gluster Docs. [online] Available at: https://docs.gluster.org/en/latest/ [Accessed 21 Nov. 2018].
Fahmy, M.M., Elghandour, I. and Nagi, M., (2016), December. CoS-HDFS: co-locating geo-distributed spatial data in hadoop distributed file system. In Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 123-132). ACM.
Kanemitsu, H., Hanada, M. and Nakazato, H., (2016). Clustering-based task scheduling in a large number of heterogeneous processors. IEEE Transactions on Parallel and Distributed Systems, 27(11), pp.3144-3157.
Khaldi, D., Jouvelot, P. and Ancourt, C., (2015). Parallelizing with BDSC, a resource constrained scheduling algorithm for shared and distributed memory systems. Parallel Computing, 41, pp.66-89.
Mrs. Ritika Dhabliya. (2020). Obstacle Detection and Text Recognition for Visually Impaired Person Based on Raspberry Pi. International Journal of New Practices in Management and Engineering, 9(02), 01 - 07. https://doi.org/10.17762/ijnpme.v9i02.83
Shukla, A., Almal, S., Gupta, A., Jain, R., Mishra, R., & Dhabliya, D. (2022). DL based system for on-board image classification in real time, applied to disaster mitigation. Paper presented at the PDGC 2022 - 2022 7th International Conference on Parallel, Distributed and Grid Computing, 663-668. doi:10.1109/PDGC56933.2022.10053139 Retrieved from www.scopus.com
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.