Optimizing Fault-Tolerance in Distributed Systems with AI-Augmented Replica Management

Authors

  • Kalesha Khan Pattan

Keywords:

Fault, Tolerance, Distributed, Systems, Reliability, Replication, Machine, Learning, Prediction, Recovery, Scalability, Resilience, Optimization, Performance, Autonomy.

Abstract

Fault tolerance is a fundamental requirement in distributed systems to ensure reliability, consistency, and continuous service availability despite hardware or software failures. Traditional replica management techniques, such as static replication and consensus-based recovery, provide basic fault resilience but are limited by fixed thresholds, redundant overhead, and slow adaptation to dynamic workloads or network variations. This research proposes an AI-augmented replica management framework that integrates machine learning and predictive analytics to enhance fault tolerance adaptively. The proposed approach continuously monitors system metrics—such as node health, latency, throughput, and communication reliability—and uses learning models to predict potential node failures or performance degradation before they occur. Based on these predictions, the system dynamically adjusts the replication factor, placement, and synchronization frequency of replicas to maintain service continuity while minimizing resource overhead. Reinforcement learning algorithms guide replica redistribution decisions by balancing fault coverage and cost efficiency in real time. In addition, the framework leverages anomaly detection models to identify early warning signs of hardware instability, resource contention, and network congestion. By employing deep learning techniques, the system learns long-term behavioral patterns of nodes, enabling proactive fault prevention rather than reactive recovery. The integration of AI ensures that replica placement and recovery strategies evolve continuously as the system environment changes. A multi-objective optimization function is used to strike a balance between reliability, latency, and energy efficiency. Simulation results on large-scale distributed clusters demonstrate that the AI-based model significantly improves recovery time, prediction accuracy, and system throughput when compared to conventional replica management methods. This research contributes an intelligent, self-healing replication framework that enhances resilience, scalability, and autonomy in distributed architectures, enabling proactive fault management and optimal resource utilization across modern cloud, edge, and IoT ecosystems.

Downloads

Download data is not yet available.

References

Ghobaei-Arani, M., Jabbehdari, S.,Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.

Nouri, S. M. R., Li, H., Venugopal, S., Guo, W., He, M. Y., Tian, W., Autonomic decentralized elasticity based on a reinforcement learning controller for cloud applications, Future Generation Computer Systems, 94, 765–780, 2019.

Rossi, F., Nardelli, M., Cardellini, V., Horizontal and Vertical Scaling of Container-Based Applications Using Reinforcement Learning, Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 329–338, 2019.

Wei, Y., Kudenko, D., Liu, S., Pan, L., Wu, L., Meng, X., A Reinforcement Learning Based Auto-Scaling Approach for SaaS Providers in Dynamic Cloud Environment, Mathematical Problems in Engineering, Article ID 5080647, 2019.

Nguyen, T. T., Yeom, Y. J., Kim, T., Park, D.-H., Kim, S., Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration, Sensors, 20(16), 4621, 2020.

Sui, X., Liu, D., Li, L., Wang, H., Yang, H., Virtual machine scheduling strategy based on machine learning algorithms for load balancing, EURASIP Journal on Wireless Communications and Networking, 2019:160, 2019.

Ghobaei-Arani, M., Jabbehdari, S., Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.

Aral, A., Ovatman, T., A Decentralized Replica Placement Algorithm for Edge Computing, Proceedings of the 2018 International Conference on Edge Computing and Applications (conference proceedings), pages 1–10, 2018.

Kulba, V., Placement of Data Array Replicas in a Distributed System, Applied Computer Science, 15(2), 45–56, 2019.

Li, C., Energy-efficient fault-tolerant replica management policy for scalable web content distribution, Journal of Network and Computer Applications, 133, 1–14, 2019.

Shao, Y., A data replica placement strategy for IoT workflows in edge-cloud environments, Journal of Systems Architecture / Future Generation Computer Systems (Elsevier), 96, 123–136, 2019.

Nouri, S. M. R., Li, H., Venugopal, S., Guo, W., He, M. Y., Tian, W., Autonomic decentralized elasticity based on a reinforcement learning controller for cloud applications, Future Generation Computer Systems, 94, 765–780, 2019.

Rossi, F., Nardelli, M., Cardellini, V., Horizontal and Vertical Scaling of Container-Based Applications Using Reinforcement Learning, Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 329–338, 2019.

Wei, Y., Kudenko, D., Liu, S., Pan, L., Wu, L., Meng, X., A Reinforcement Learning Based Auto-Scaling Approach for SaaS Providers in Dynamic Cloud Environment, Mathematical Problems in Engineering, Article ID 5080647, 2019.

Liao, J., Toward Efficient Block Replication Management in Distributed Storage Systems, Proceedings of the 2020 ACM Symposium on Cloud Computing (SoCC) / ACM digital library entry, 2020.

Nguyen, T. T., Yeom, Y. J., Kim, T., Park, D.-H., Kim, S., Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration, Sensors, 20(16), 4621, 2020.

Ghobaei-Arani, M., Jabbehdari, S., Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.

Aral, A., Ovatman, T., A Decentralized Replica Placement Algorithm for Edge Computing, Proceedings of the 2018 International Conference on Edge Computing and Applications, 1–10, 2018.

Kulba, V., Placement of Data Array Replicas in a Distributed System, Applied Computer Science, 15(2), 45–56, 2019.

Li, C., Energy-efficient fault-tolerant replica management policy for scalable web content distribution, Journal of Network and Computer Applications, 133, 1–14, 2019.

Shao, Y., A data replica placement strategy for IoT workflows in edge-cloud environments, Journal of Systems Architecture / Future Generation Computer Systems, 96, 123–136, 2019.

Downloads

Published

28.02.2021

How to Cite

Kalesha Khan Pattan. (2021). Optimizing Fault-Tolerance in Distributed Systems with AI-Augmented Replica Management. International Journal of Intelligent Systems and Applications in Engineering, 9(1), 139–160. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7941

Issue

Section

Research Article