Optimizing Fault-Tolerance in Distributed Systems with AI-Augmented Replica Management
Keywords:
Fault, Tolerance, Distributed, Systems, Reliability, Replication, Machine, Learning, Prediction, Recovery, Scalability, Resilience, Optimization, Performance, Autonomy.Abstract
Fault tolerance is a fundamental requirement in distributed systems to ensure reliability, consistency, and continuous service availability despite hardware or software failures. Traditional replica management techniques, such as static replication and consensus-based recovery, provide basic fault resilience but are limited by fixed thresholds, redundant overhead, and slow adaptation to dynamic workloads or network variations. This research proposes an AI-augmented replica management framework that integrates machine learning and predictive analytics to enhance fault tolerance adaptively. The proposed approach continuously monitors system metrics—such as node health, latency, throughput, and communication reliability—and uses learning models to predict potential node failures or performance degradation before they occur. Based on these predictions, the system dynamically adjusts the replication factor, placement, and synchronization frequency of replicas to maintain service continuity while minimizing resource overhead. Reinforcement learning algorithms guide replica redistribution decisions by balancing fault coverage and cost efficiency in real time. In addition, the framework leverages anomaly detection models to identify early warning signs of hardware instability, resource contention, and network congestion. By employing deep learning techniques, the system learns long-term behavioral patterns of nodes, enabling proactive fault prevention rather than reactive recovery. The integration of AI ensures that replica placement and recovery strategies evolve continuously as the system environment changes. A multi-objective optimization function is used to strike a balance between reliability, latency, and energy efficiency. Simulation results on large-scale distributed clusters demonstrate that the AI-based model significantly improves recovery time, prediction accuracy, and system throughput when compared to conventional replica management methods. This research contributes an intelligent, self-healing replication framework that enhances resilience, scalability, and autonomy in distributed architectures, enabling proactive fault management and optimal resource utilization across modern cloud, edge, and IoT ecosystems.
Downloads
References
Ghobaei-Arani, M., Jabbehdari, S.,Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.
Nouri, S. M. R., Li, H., Venugopal, S., Guo, W., He, M. Y., Tian, W., Autonomic decentralized elasticity based on a reinforcement learning controller for cloud applications, Future Generation Computer Systems, 94, 765–780, 2019.
Rossi, F., Nardelli, M., Cardellini, V., Horizontal and Vertical Scaling of Container-Based Applications Using Reinforcement Learning, Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 329–338, 2019.
Wei, Y., Kudenko, D., Liu, S., Pan, L., Wu, L., Meng, X., A Reinforcement Learning Based Auto-Scaling Approach for SaaS Providers in Dynamic Cloud Environment, Mathematical Problems in Engineering, Article ID 5080647, 2019.
Nguyen, T. T., Yeom, Y. J., Kim, T., Park, D.-H., Kim, S., Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration, Sensors, 20(16), 4621, 2020.
Sui, X., Liu, D., Li, L., Wang, H., Yang, H., Virtual machine scheduling strategy based on machine learning algorithms for load balancing, EURASIP Journal on Wireless Communications and Networking, 2019:160, 2019.
Ghobaei-Arani, M., Jabbehdari, S., Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.
Aral, A., Ovatman, T., A Decentralized Replica Placement Algorithm for Edge Computing, Proceedings of the 2018 International Conference on Edge Computing and Applications (conference proceedings), pages 1–10, 2018.
Kulba, V., Placement of Data Array Replicas in a Distributed System, Applied Computer Science, 15(2), 45–56, 2019.
Li, C., Energy-efficient fault-tolerant replica management policy for scalable web content distribution, Journal of Network and Computer Applications, 133, 1–14, 2019.
Shao, Y., A data replica placement strategy for IoT workflows in edge-cloud environments, Journal of Systems Architecture / Future Generation Computer Systems (Elsevier), 96, 123–136, 2019.
Nouri, S. M. R., Li, H., Venugopal, S., Guo, W., He, M. Y., Tian, W., Autonomic decentralized elasticity based on a reinforcement learning controller for cloud applications, Future Generation Computer Systems, 94, 765–780, 2019.
Rossi, F., Nardelli, M., Cardellini, V., Horizontal and Vertical Scaling of Container-Based Applications Using Reinforcement Learning, Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 329–338, 2019.
Wei, Y., Kudenko, D., Liu, S., Pan, L., Wu, L., Meng, X., A Reinforcement Learning Based Auto-Scaling Approach for SaaS Providers in Dynamic Cloud Environment, Mathematical Problems in Engineering, Article ID 5080647, 2019.
Liao, J., Toward Efficient Block Replication Management in Distributed Storage Systems, Proceedings of the 2020 ACM Symposium on Cloud Computing (SoCC) / ACM digital library entry, 2020.
Nguyen, T. T., Yeom, Y. J., Kim, T., Park, D.-H., Kim, S., Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration, Sensors, 20(16), 4621, 2020.
Ghobaei-Arani, M., Jabbehdari, S., Pourmina, M. A., An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach, Future Generation Computer Systems, 78, 191–210, 2018.
Aral, A., Ovatman, T., A Decentralized Replica Placement Algorithm for Edge Computing, Proceedings of the 2018 International Conference on Edge Computing and Applications, 1–10, 2018.
Kulba, V., Placement of Data Array Replicas in a Distributed System, Applied Computer Science, 15(2), 45–56, 2019.
Li, C., Energy-efficient fault-tolerant replica management policy for scalable web content distribution, Journal of Network and Computer Applications, 133, 1–14, 2019.
Shao, Y., A data replica placement strategy for IoT workflows in edge-cloud environments, Journal of Systems Architecture / Future Generation Computer Systems, 96, 123–136, 2019.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


