Predictive Failure Detection in AI Datacenters Using BMC Telemetry Analytics

Authors

  • Seshadri Ravikiran Vedula

Keywords:

Predictive Failure Detection, BMC Telemetry, AI Datacentres, Machine Learning, Datacentre Reliability, Server Failure Prediction.

Abstract

Machine learning training, cloud computing, and large-scale data processing are some of the services that are supported by AI datacentres. Since the systems are built with thousands of servers and hardware, they can fail at any moment and damage the service, loss of resources, and escalate the cost of working. Hence, potential failures should be identified early to ensure successful operations of datacentres. It is proposed in this research that an AI datacentre predictive failure detection technique, based on the Baseboard Management Controller (BMC) telemetry analytics service and machine learning, would help implement predictive maintenance plans. Servers are monitored to collect the telemetry data of CPU temperature, GPUs temperature, power consumption, the speed of the fans, and the voltage level and analyze them to detect an abnormal behavior ahead of failure. These models of machine learning such as the Logistic Regression, Support Vector machine, and the random forest are all trained over pre-processed features following preprocessing and extraction of features. The experiments have indicated that the model of the Random Forest is the most effective, although the model has an accuracy of 89.7, precision of 86.3, recall of 88.1 and F1-score of 87.2. Another finding of the results is that these telemetry features like temperature of the GPU and power consumption are good indicators of unstable systems. In the given approach, it is established that BMC telemetry analytics can enhance predictive monitoring and reliability of contemporary AI datacentres to a large extent.

 

Downloads

Download data is not yet available.

References

Sîrbu, A., and O. Babaoglu, “Towards data-driven autonomics in data centers,” arXiv preprint arXiv:1505.04935, 2015.

Raj, V. M., and R. Shriram, “Power management in virtualized datacenter – A survey,” Journal of Network and Computer Applications, vol. 69, pp. 117–133, 2016.

Parepalli, S., “Data hygiene and batch optimization in enterprise CRM: A framework for scalable, high-quality customer data integration,” Journal of Scientific and Engineering Research, vol. 3, no. 5, pp. 285–292, 2016.

Lin, Y., Y. Zhou, Z. Liu, K. Liu, Y. Wang, M. Xu, J. Bi, Y. Liu, and J. Wu, “NetView: Towards on-demand network-wide telemetry in the data center,” Computer Networks, vol. 180, p. 107386, 2020.

Lee, Y., D. Juan, X. Tseng, Y. Chen, and S. Chang, “DC-Prophet: Predicting catastrophic machine failures in datacenters,” arXiv preprint arXiv:1709.06537, 2017.

Xiao, W., “A probabilistic machine learning approach to detect industrial plant faults,” arXiv preprint arXiv:1603.05770, 2016.

Majumder, B. P., A. Sengupta, S. Jain, and P. Bhaduri, “Fault detection engine in intelligent predictive analytics platform for DCIM,” arXiv preprint arXiv:1610.04872, 2016.

Netti, A., Z. Kiziltan, O. Babaoglu, A. Sîrbu, A. Bartolini, and A. Borghesi, “Online fault classification in HPC systems through machine learning,” arXiv preprint arXiv:1810.11208, 2018.

Ghiasvand, S., and F. M. Ciorba, “Anomaly detection in high performance computers: A vicinity perspective,” in Proc. IEEE Int. Symp. Parallel Distrib. Comput. (ISPDC), 2019, pp. 112–120.

Sîrbu, A., and O. Babaoglu, “Towards operator-less data centers through data-driven, predictive, proactive autonomics,” Cluster Computing, vol. 19, no. 2, pp. 865–878, 2016.

Schmidt, F., M. Niepert, and F. Huici, “Representation learning for resource usage prediction,” arXiv preprint arXiv:1802.00673, 2018.

Giurgiu, I., and A. Schumann, “Explainable failure predictions with RNN classifiers based on time series data,” arXiv preprint arXiv:1901.08554, 2019.

De O, D. C. P. R., A. Akcay, Y. Zhang, and U. Kaymak, “Remaining useful lifetime prediction via deep domain adaptation,” arXiv preprint arXiv:1907.07480, 2019.

Amruthnath, N., and T. Gupta, “Fault diagnosis using clustering: What statistical test to use for hypothesis testing?” Machine Learning and Applications: An International Journal, vol. 6, no. 1, pp. 17–33, 2019.

Thota, M. R., “Advancing mission-critical data platforms through predictive observability and autonomous diagnostics,” European Journal of Advances in Engineering and Technology, vol. 6, no. 1, pp. 162–174, 2019.

Liu, Y., S. Garg, J. Nie, Y. Zhang, Z. Xiong, J. Kang, and M. S. Hossain, “Deep anomaly detection for time-series data in industrial IoT: A communication-efficient on-device federated learning approach,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6348–6358, 2020.

M. D. S. B., J. Callaham, J. Jonker, N. Goebel, J. Klemisch, D. McDonald, N. Hicks, J. N. Kutz, S. L. Brunton, and A. Y. Aravkin, “Physics-informed machine learning for sensor fault detection with flight test data,” arXiv preprint arXiv:2006.13380, 2020.

Sater, R. A., and A. B. Hamza, “A federated learning approach to anomaly detection in smart buildings,” arXiv preprint arXiv:2010.10293, 2020.

Von Enzberg, S., A. Naskos, I. Metaxa, D. Köchling, and A. Kühn, “Implementation and transfer of predictive analytics for smart maintenance: A case study,” Frontiers in Computer Science, vol. 2, 2020.

Downloads

Published

25.09.2023

How to Cite

Seshadri Ravikiran Vedula. (2023). Predictive Failure Detection in AI Datacenters Using BMC Telemetry Analytics. International Journal of Intelligent Systems and Applications in Engineering, 11(4), 1123–1132. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/8297

Issue

Section

Research Article