Predictive Failure Detection in AI Datacenters Using BMC Telemetry Analytics
Keywords:
Predictive Failure Detection, BMC Telemetry, AI Datacentres, Machine Learning, Datacentre Reliability, Server Failure Prediction.Abstract
Machine learning training, cloud computing, and large-scale data processing are some of the services that are supported by AI datacentres. Since the systems are built with thousands of servers and hardware, they can fail at any moment and damage the service, loss of resources, and escalate the cost of working. Hence, potential failures should be identified early to ensure successful operations of datacentres. It is proposed in this research that an AI datacentre predictive failure detection technique, based on the Baseboard Management Controller (BMC) telemetry analytics service and machine learning, would help implement predictive maintenance plans. Servers are monitored to collect the telemetry data of CPU temperature, GPUs temperature, power consumption, the speed of the fans, and the voltage level and analyze them to detect an abnormal behavior ahead of failure. These models of machine learning such as the Logistic Regression, Support Vector machine, and the random forest are all trained over pre-processed features following preprocessing and extraction of features. The experiments have indicated that the model of the Random Forest is the most effective, although the model has an accuracy of 89.7, precision of 86.3, recall of 88.1 and F1-score of 87.2. Another finding of the results is that these telemetry features like temperature of the GPU and power consumption are good indicators of unstable systems. In the given approach, it is established that BMC telemetry analytics can enhance predictive monitoring and reliability of contemporary AI datacentres to a large extent.
Downloads
References
Sîrbu, A., and O. Babaoglu, “Towards data-driven autonomics in data centers,” arXiv preprint arXiv:1505.04935, 2015.
Raj, V. M., and R. Shriram, “Power management in virtualized datacenter – A survey,” Journal of Network and Computer Applications, vol. 69, pp. 117–133, 2016.
Parepalli, S., “Data hygiene and batch optimization in enterprise CRM: A framework for scalable, high-quality customer data integration,” Journal of Scientific and Engineering Research, vol. 3, no. 5, pp. 285–292, 2016.
Lin, Y., Y. Zhou, Z. Liu, K. Liu, Y. Wang, M. Xu, J. Bi, Y. Liu, and J. Wu, “NetView: Towards on-demand network-wide telemetry in the data center,” Computer Networks, vol. 180, p. 107386, 2020.
Lee, Y., D. Juan, X. Tseng, Y. Chen, and S. Chang, “DC-Prophet: Predicting catastrophic machine failures in datacenters,” arXiv preprint arXiv:1709.06537, 2017.
Xiao, W., “A probabilistic machine learning approach to detect industrial plant faults,” arXiv preprint arXiv:1603.05770, 2016.
Majumder, B. P., A. Sengupta, S. Jain, and P. Bhaduri, “Fault detection engine in intelligent predictive analytics platform for DCIM,” arXiv preprint arXiv:1610.04872, 2016.
Netti, A., Z. Kiziltan, O. Babaoglu, A. Sîrbu, A. Bartolini, and A. Borghesi, “Online fault classification in HPC systems through machine learning,” arXiv preprint arXiv:1810.11208, 2018.
Ghiasvand, S., and F. M. Ciorba, “Anomaly detection in high performance computers: A vicinity perspective,” in Proc. IEEE Int. Symp. Parallel Distrib. Comput. (ISPDC), 2019, pp. 112–120.
Sîrbu, A., and O. Babaoglu, “Towards operator-less data centers through data-driven, predictive, proactive autonomics,” Cluster Computing, vol. 19, no. 2, pp. 865–878, 2016.
Schmidt, F., M. Niepert, and F. Huici, “Representation learning for resource usage prediction,” arXiv preprint arXiv:1802.00673, 2018.
Giurgiu, I., and A. Schumann, “Explainable failure predictions with RNN classifiers based on time series data,” arXiv preprint arXiv:1901.08554, 2019.
De O, D. C. P. R., A. Akcay, Y. Zhang, and U. Kaymak, “Remaining useful lifetime prediction via deep domain adaptation,” arXiv preprint arXiv:1907.07480, 2019.
Amruthnath, N., and T. Gupta, “Fault diagnosis using clustering: What statistical test to use for hypothesis testing?” Machine Learning and Applications: An International Journal, vol. 6, no. 1, pp. 17–33, 2019.
Thota, M. R., “Advancing mission-critical data platforms through predictive observability and autonomous diagnostics,” European Journal of Advances in Engineering and Technology, vol. 6, no. 1, pp. 162–174, 2019.
Liu, Y., S. Garg, J. Nie, Y. Zhang, Z. Xiong, J. Kang, and M. S. Hossain, “Deep anomaly detection for time-series data in industrial IoT: A communication-efficient on-device federated learning approach,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6348–6358, 2020.
M. D. S. B., J. Callaham, J. Jonker, N. Goebel, J. Klemisch, D. McDonald, N. Hicks, J. N. Kutz, S. L. Brunton, and A. Y. Aravkin, “Physics-informed machine learning for sensor fault detection with flight test data,” arXiv preprint arXiv:2006.13380, 2020.
Sater, R. A., and A. B. Hamza, “A federated learning approach to anomaly detection in smart buildings,” arXiv preprint arXiv:2010.10293, 2020.
Von Enzberg, S., A. Naskos, I. Metaxa, D. Köchling, and A. Kühn, “Implementation and transfer of predictive analytics for smart maintenance: A case study,” Frontiers in Computer Science, vol. 2, 2020.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


