Resilient GenAI Infrastructures: Patterns for Fail-Safe and Cost-Efficient Design
Keywords:
Generative AI, LLM serving, resilience engineering, cost optimization, serverless, autoscaling, observability, microservices.Abstract
Generative AI systems have moved from experimental deployments to business-critical services, but their infrastructure remains unusually fragile and expensive. Compared with conventional web applications, GenAI platforms face bursty demand, large GPU memory footprints, model-serving state such as KV caches and conversation context, heterogeneous hardware, and tight latency expectations. These properties make resilience and cost optimization inseparable design goals. This paper examines resilient GenAI infrastructure as a systems problem spanning orchestration, model serving, autoscaling, observability, and governance. Drawing on recent work from distributed ML, serverless systems, microservice reliability, LLM serving, and cloud cost optimization, it argues that the strongest designs are hybrid rather than monolithic: they combine fail-safe patterns such as graceful degradation, redundant placement, state externalization, admission control, and causal observability with cost levers such as heterogeneous model portfolios, memory-aware serving, predictive autoscaling, layered deployment artifacts, and burst absorption through serverless or elastic tiers. The paper proposes a reference architecture for production GenAI that prioritizes bounded failure, predictable latency, and measurable unit economics. It concludes that resilient GenAI infrastructure is best designed as a closed-loop control system where reliability mechanisms are chosen not merely to prevent outages, but to preserve acceptable service under degraded, cost-constrained conditions.
Downloads
References
Xing, E. P., Ho, Q., Xie, P., & Wei, D. Y. Strategies and Principles of Distributed Machine Learning on Big Data. Engineering, 2(2), 179-195, 2016. DOI: 10.1016/j.eng.2016.02.008
Li, Z., Cheng, Y., et al. The Serverless Computing Survey: A Technical Primer for Design Architecture. ACM Computing Surveys, 54(10s), 2022. DOI: 10.1145/3508360
Hassan, H. B., Barakat, S. A., & Sarhan, Q. I. Survey on serverless computing. Journal of Cloud Computing, 10, 39, 2021. DOI: 10.1186/s13677-021-00253-7
Kirti, M., Maurya, A. K., & Yadav, R. S. Fault-tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directions. Concurrency and Computation: Practice and Experience, 2024. DOI: 10.1002/cpe.8081
Sreekanti, V., et al. A fault-tolerance shim for serverless computing. In Proceedings of EuroSys 2020. DOI: 10.1145/3342195.3387535
Taherizadeh, S., & Stankovski, V. Dynamic Multi-level Auto-scaling Rules for Containerized Applications. The Computer Journal, 62(12), 2019. DOI: 10.1093/comjnl/bxy043
Toka, L., Dobreff, G., Fodor, B., & Sonkoly, B. Adaptive AI-based auto-scaling for Kubernetes. In CCGrid 2020. DOI: 10.1109/CCGrid49817.2020.00033
Gan, Y., et al. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In ASPLOS 2021. DOI: 10.1145/3445814.3446700
Li, M., et al. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In KDD 2022. DOI: 10.1145/3534678.3539041
Wu, Y., Lentz, M., Zhuo, D., & Lu, Y. Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures. Proceedings of the VLDB Endowment, 16(2), 2022. DOI: 10.14778/3570690.3570692
Yang, Y., et al. INFless: A Native Serverless System for Low-Latency, High-Throughput Inference. In ASPLOS 2022. DOI: 10.1145/3503222.3507709
Ahmad, S., et al. Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling. In ASPLOS 2024. DOI: 10.1145/3617232.3624849
Yu, L., Lin, J., & Li, J. Stateful Large Language Model Serving with Pensieve. In EuroSys 2025. DOI: 10.1145/3689031.3696086
Gim, I., et al. Pie: A Programmable Serving System for Emerging LLM Applications. In SOSP 2025. DOI: 10.1145/3731569.3764814
Kim, M., et al. Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization. 2025. DOI: 10.1145/3695053.3731019
Song, Y., et al. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In SOSP 2024. DOI: 10.1145/3694715.3695964
Prabhu, R., et al. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. In ASPLOS 2025. DOI: 10.1145/3669940.3707256
Gu, L., Zeng, D., Hu, J., Jin, H., Guo, S., & Zomaya, A. Y. Exploring Layered Container Structure for Cost Efficient Microservice Deployment. In INFOCOM 2021. DOI: 10.1109/INFOCOM42981.2021.9488918
Fu, K., Zhang, W., Chen, Q., Zeng, D., & Guo, M. Adaptive Resource Efficient Microservice Deployment in Cloud-Edge Continuum. IEEE Transactions on Parallel and Distributed Systems, 33(8), 2022. DOI: 10.1109/TPDS.2021.3128037
Zhao, H., Deng, S., Liu, Z., Yin, J., & Dustdar, S. Distributed Redundant Placement for Microservice-based Applications at the Edge. IEEE Transactions on Services Computing, 15(4), 2022. DOI: 10.1109/TSC.2020.3013600
Pallewatta, S., Kostakos, V., & Buyya, R. Reliability-Aware Proactive Placement of Microservices-Based IoT Applications in Fog Computing Environments. IEEE Transactions on Mobile Computing, 2024. DOI: 10.1109/TMC.2024.3394486
Barrak, A., Petrillo, F., & Jaafar, F. Serverless on Machine Learning: A Systematic Mapping Study. IEEE Access, 10, 2022. DOI: 10.1109/ACCESS.2022.3206366
Chahal, D., Mishra, M., Palepu, S. C., & Singhal, R. Performance and Cost Comparison of Cloud Services for Deep Learning Workload. In ICPE Companion 2021. DOI: 10.1145/3447545.3451184
Chahal, D., Mishra, M., Palepu, S., & Singhal, R. Pay-as-you-Train: Efficient Ways of Serverless Training. In IC2E 2022. DOI: 10.1109/IC2E55432.2022.00020
Jayaram, K. R., et al. FfDL: A Flexible Multi-tenant Deep Learning Platform. In Middleware 2019. DOI: 10.1145/3361525.3361538
Tirumalasetty, P. (2024). A data-driven modular framework for predicting single-cell DNA methylation landscapes. Membrane Technology, 2024(4), 123–134.
Tirumalasetty, P. (2021). Prediction of car prices: Linear regression with multiple variables. University of Michigan–Dearborn Canvas. https://canvas.umd.umich.edu/courses/524480/discussion_topics/201956
Tirumalasetty, P. (2022). The greener, the better. University of Michigan–Dearborn Canvas. https://canvas.umd.umich.edu/courses/524480/discussion_topics/2019561.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


