Utilization of Genetic Algorithm and Significance Scores for Feature Selection in the Interest of Increasing Accuracy of Fault Detection in Hard Disk Drives for HDFS

Authors

  • B. K. Prasad Banavathu Research Scholar, Computer Science and Engineering, Jawaharlal Nehru Technological University Anantapur (JNTUA), Ananthapuramu, Andhra Pradesh 515002, India
  • A. Ananda Rao Professor, Computer Science and Engineering, Rayalaseema University, Kurnool, Andhra Pradesh 518001, India

Keywords:

Genetic Algorithm, Significance Scores, Fault Detection in HDD, SMART

Abstract

The term "hard disk drive" (HDD) refers to a storage device used in computers and servers. If these components suddenly stop working, vital information could be lost forever. Most hard disk drives (HDD) include SMART technology, which allows them to track a variety of performance metrics and report on their own health status. However, not all SMART characteristics may be relied upon to spot a failing HDD. In this research, we offer a two-stage process for choosing the best HDD failure indicators. First, a GA is used to narrow down the SMART qualities to a manageable set that yields feature vectors that are intuitive to separate and naturally cluster. The best subset of features is determined by the GA based solely on the fitness of a set of SMART attribute pairs. The use of a significance score to measure a feature's statistical impact to disk failures in a second layer is suggested to improve the GA's feature selection even more. This hand-picked collection of SMART traits is used to train the naive Bayes classifier, a generative classifier. The suggested approach outperforms cutting-edge alternatives in terms of failure detection and false alarm rate, according to extensive testing on a SMART dataset obtained from a commercial datacentre. There is no need to fine-tune any parameters or thresholds, and the classifier just needs to be trained on a smaller set of SMART properties.

Downloads

Download data is not yet available.

References

Coursey, G. Nath, S. Prabhu and S. Sengupta, "Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks," 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp. 4832-4841, doi: 10.1109/BigData52589.2021.9671605.

"A multi-instance LSTM network for failure detection of hard disk drives," 2020 IEEE 18th International Conference on Industrial Informatics (INDIN), Warwick, United Kingdom, 2020, pp. 709-712, doi: 10.1109/INDIN45582.2020.9442240.

F. L. F. Pereira, I. Castro Chaves, J. P. P. Gomes and J. C. Machado, "Using Autoencoders for Anomaly Detection in Hard Disk Drives," 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1-7, doi: 10.1109/IJCNN48605.2020.9206689.

Lee, C. & Cao, Yi & Ng, Kam K.H.. (2017). Big Data Analytics for Predictive Maintenance Strategies. 10.4018/978-1-5225-0956-1.ch004.

G. Wang, Y. Wang and X. Sun, "Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives," in IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1-9, 2021, Art no. 3513509, doi: 10.1109/TIM.2021.3068180.

J. Zeng, R. Ba, Q. Chen, L. Wu, H. Wang and Y. Xiong, "Prediction of Hard Drive Failures for Data Center Based on LightGBM," 2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom), Xi'an, China, 2022, pp. 105-110, doi: 10.1109/CSCloud-EdgeCom54986.2022.00027.

M. Simongyi and P. Chongstitvatana, "Machine Learning Methods for Abnormality Detection in Hard Disk Drive Assembly Process: Bi-LSTM, Wavelet-CNN and SVM," 2018 2nd European Conference on Electrical Engineering and Computer Science (EECS), Bern, Switzerland, 2018, pp. 392-399, doi: 10.1109/EECS.2018.00079.

L. P. Queiroz et al., "A Fault Detection Method for Hard Disk Drives Based on Mixture of Gaussians and Nonparametric Statistics," in IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 542-550, April 2017, doi: 10.1109/TII.2016.2619180.

Prafullata Auradkar et al., Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSD, Procedia Computer Science 167 (2020) 2253–2266

Mukhtaj Khan, Zhengwen Huang, Maozhen Li, Gareth A. Taylor, Phillip M. Ashton, Mushtaq Khan, "Optimizing Hadoop Performance for Big Data Analytics in Smart Grid", Mathematical Problems in Engineering, vol. 2017, Article ID 2198262, 11 pages, 2017.

https://doi.org/10.1155/2017/2198262

Ahmad, S.G., Liew, C.S., Munir, E.U., Ang, T.F. and Khan, S.U., (2016). A hybrid genetic algorithm for optimization of scheduling workflow applications in heterogeneous computing systems. Journal of Parallel and Distributed Computing, 87, pp.80-90.

Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bag of Words Data Set.

[online] Available at: https://archive.ics.uci.edu/ml/datasets/bag+of+words [Accessed 9 Apr. 2019]

Dai, W., Ibrahim, I. and Bassiouni, M., (2017), June. An improved replica placement policy for Hadoop Distributed File System running on Cloud platforms. In Cyber Security and Cloud Computing (CSCloud), 2017 IEEE 4th International Conference on (pp. 270-275). IEEE.

Dharanipragada, J., Padala, S., Kammili, B. and Kumar, V., (2017). Tula: A disk latency aware balancing and block placement strategy for Hadoop. In Big Data (Big Data), 2017 IEEE International Conference on (pp. 2853-2858). IEEE.

Docs.gluster.org. (2018). Home - Gluster Docs. [online] Available at: https://docs.gluster.org/en/latest/ [Accessed 21 Nov. 2018].

Fahmy, M.M., Elghandour, I. and Nagi, M., (2016), December. CoS-HDFS: co-locating geo-distributed spatial data in hadoop distributed file system. In Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 123-132). ACM.

Kanemitsu, H., Hanada, M. and Nakazato, H., (2016). Clustering-based task scheduling in a large number of heterogeneous processors. IEEE Transactions on Parallel and Distributed Systems, 27(11), pp.3144-3157.

Khaldi, D., Jouvelot, P. and Ancourt, C., (2015). Parallelizing with BDSC, a resource constrained scheduling algorithm for shared and distributed memory systems. Parallel Computing, 41, pp.66-89.

Mrs. Ritika Dhabliya. (2020). Obstacle Detection and Text Recognition for Visually Impaired Person Based on Raspberry Pi. International Journal of New Practices in Management and Engineering, 9(02), 01 - 07. https://doi.org/10.17762/ijnpme.v9i02.83

Shukla, A., Almal, S., Gupta, A., Jain, R., Mishra, R., & Dhabliya, D. (2022). DL based system for on-board image classification in real time, applied to disaster mitigation. Paper presented at the PDGC 2022 - 2022 7th International Conference on Parallel, Distributed and Grid Computing, 663-668. doi:10.1109/PDGC56933.2022.10053139 Retrieved from www.scopus.com

Downloads

Published

16.08.2023

How to Cite

Banavathu, B. K. P. ., & Rao, A. A. . (2023). Utilization of Genetic Algorithm and Significance Scores for Feature Selection in the Interest of Increasing Accuracy of Fault Detection in Hard Disk Drives for HDFS . International Journal of Intelligent Systems and Applications in Engineering, 11(10s), 397–406. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3294

Issue

Section

Research Article