Real-Time Performance Monitoring for Deep Learning Models in Production
Keywords:
Real-Time Monitoring, Deep Learning Inference, Performance Metrics, Production AI Systems, Model ObservabilityAbstract
Since deep learning models continue moving to production scenarios, it is becoming important to ensure that they operate in a real-time setting. The latency of inference or throughput, and utilisation of system resources have a direct implication on the user experience, reliability of service, and cost of operation. Practical operation presents inconsistencies in the input data, the workload fluctuations, and the environment, and, therefore, to ensure the efficiency, quality, and integrity of the system, real-time monitoring is important to track its performance and ensure it. The real-time monitoring tools give an ongoing understanding of the behaviour of deep learning models when they are deployed in a wide variety of production settings--cloud-hosted services, on-site data centres, or edge devices. Such tools measure the critical parameters of CPU and GPU utilisation, memory usage, inference latency, and model response times. Beyond that, they facilitate anomaly detection that can be used to detect deviation in the anticipated behaviour, thus may point to drift in a model, contention in resources, or bottlenecks in systems. More contemporary technologies, such as Prometheus, Grafana, NVIDIA Triton Inference Server, and proprietary observability tools, are becoming incorporated in ML pipelines to alert and visualize according to their on-demand performance metrics. Such tools not only ease diagnosis but also guide automated scaling, load balancing, and model retraining solutions. Real-time monitoring is additional in guaranteeing services are achieved according to service-level agreements (SLAs), more specifically in the case of mission-critical applications, including medical diagnosis, securities, and virtual and artificial machines. By integrating real-time monitoring of performance within the life cycle of deep learning deployment, results in constant optimisation and increased visibility of operations, as well as proactive fault mitigation. This eventually aids the delivery of scalable, reliable, cost-efficient AI services. Thus, this paper discusses how integrating workload modeling and bottleneck feedback loops in hardware/software co-design helps manage design uncertainty and improve system adaptability.
Downloads
References
A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
A. C. Bahnsen, D. Aouada, A. Stojanovic, and B. Ottersten, “Feature engineering strategies for credit card fraud detection,” Expert Syst. Appl., vol. 51, pp. 134–142, 2016.
D. Nigenda, Z. Karnin, M. B. Zafar, R. Ramesha, A. Tan, M. Donini, et al., “Amazon SageMaker Model Monitor: A system for real-time insights into deployed machine learning models,” in Proc. 28th ACM SIGKDD Conf. Knowledge Discovery Data Mining, 2022, pp. 3671–3681.
Y. C. Wang, J. Xue, C. Wei, and C. C. J. Kuo, “An overview on generative AI at scale with edge–cloud computing,” IEEE Open J. Commun. Soc., vol. 4, pp. 2952–2971, 2023.
S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, et al., “Software engineering for machine learning: A case study,” in Proc. 41st IEEE/ACM Int. Conf. Software Engineering (SEIP), 2019, pp. 291–300.
A. Swaminathan and T. Joachims, “Batch learning from logged bandit feedback through counterfactual risk minimization,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1731–1755, 2015.
D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in Proc. 14th USENIX Symp. Networked Syst. Design Implementation (NSDI), 2017, pp. 613–627.
J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4, pp. 1–37, 2014.
C. C. Yang and G. Cong, “Accelerating data loading in deep neural network training,” in Proc. 26th IEEE Int. Conf. High Perform. Comput., Data, Analytics (HiPC), Dec. 2019, pp. 235–245.
L. Wesolowski, B. Acun, V. Andrei, A. Aziz, G. Dankel, C. Gregg, et al., “Datacenter-scale analysis and optimization of GPU machine learning workloads,” IEEE Micro, vol. 41, no. 5, pp. 101–112, 2021.
M. Jegorova, C. Kaul, C. Mayor, A. Q. O’Neil, A. Weir, R. Murray-Smith, et al., “Survey: Leakage and privacy at inference time,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7, pp. 9090–9108, 2022.
Y. Fu, T. M. Nguyen, and D. Wentzlaff, “Coherence domain restriction on large scale systems,” in Proc. 48th Int. Symp. Microarchitecture, 2015, pp. 686–698.
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, 2016.
B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, Omega, and Kubernetes,” Commun. ACM, vol. 59, no. 5, pp. 50–57, 2016.
Y. Zhou and K. Yang, “Exploring TensorRT to improve real-time inference for deep learning,” in Proc. 24th IEEE Int. Conf. High Perform. Comput. Commun. (HPCC), 2022, pp. 2011–2018.
A. Gharaibeh, M. A. Salahuddin, S. J. Hussini, A. Khreishah, I. Khalil, M. Guizani, et al., “Smart cities: A survey on data management, security, and enabling technologies,” IEEE Commun. Surv. Tutor., vol. 19, no. 4, pp. 2456–2501, 2017.
T. Schlossnagle, J. Sheehy, and C. McCubbin, “Always-on time-series database: Keeping up where there’s no way to catch up,” Commun. ACM, vol. 64, no. 7, pp. 50–56, 2021.
N. Bhawsinka, “Change tracking and observability for complex software development,” M.S. thesis, 2023.
S. Liang, Y. Wang, C. Liu, L. He, H. Li, D. Xu, and X. Li, “EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks,” IEEE Trans. Comput., vol. 70, no. 9, pp. 1511–1525, 2020.
T. Gajger, “NVIDIA GPU performance monitoring using an extension for Dynatrace OneAgent,” Scalable Comput. Pract. Exp., vol. 21, no. 4, pp. 689–699, 2020.
Z. Xu, R. Wang, G. Balaji, M. Bundele, X. Liu, L. Liu, et al., “Alertiger: Deep learning for AI model health monitoring at LinkedIn,” in Proc. 29th ACM SIGKDD Conf., 2023, pp. 5350–5359.
D. G. Blanco, Practical OpenTelemetry. Apress, 2023.
B. Hutchinson, A. Smart, R. Hanna, R. Denton, C. Greer, O. Kjartansson, et al., “Towards accountability for machine learning datasets: Practices from software engineering and infrastructure,” in Proc. ACM Conf. Fairness, Accountability, Transparency (FAccT), Mar. 2021, pp. 560–575.
P. Dewan, “A guide to Suite,” Tech. Rep. SERC-TR-60-P, Purdue Univ., 1990.
A. Fatahi Baarzi, “Efficient service deployment on public cloud: A cost, performance, and security perspective,” 2022.
A. Bauer, M. Leucker, and C. Schallhart, “Monitoring of real-time properties,” in Proc. Int. Conf. Foundations Software Technology Theoretical Comput. Sci., 2006, pp. 260–272.
K. Alpernas, A. Panda, L. Ryzhyk, and M. Sagiv, “Cloud-scale runtime verification of serverless applications,” in Proc. ACM Symp. Cloud Comput., 2021, pp. 92–107.
E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The ML test score: A rubric for ML production readiness and technical debt reduction,” in Proc. IEEE Int. Conf. Big Data, 2017, pp. 1123–1132.
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, et al., “Hidden technical debt in machine learning systems,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.
B. Burns, Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services. O’Reilly Media, 2018.
P. Rangarajan and D. Bounds, Cloud Native AI and Machine Learning on AWS. BPB Publications, 2023.
R. Santos Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. Ferreira Da Silva, “Towards lightweight data integration using multi-workflow provenance and data observability,” Oak Ridge National Lab (ORNL), 2023.
P. Lakarasu, “Operationalizing intelligence: A unified approach to MLOps and scalable AI workflows in hybrid cloud environments,” SSRN 5236647, 2022.
S. Guduru, “AI-Enhanced Threat Detection Graph Convolutional Networks (GCNs) for Zeek Log Analysis in Splunk ES,” J. Sci. Eng. Res., vol. 10, no. 8, pp. 166–173, 2023.
S. Tatineni, “A comprehensive overview of DevOps and its operational strategies,” Int. J. Inf. Technol. Manage. Inf. Syst., vol. 12, no. 1, pp. 15–32, 2021.
C. C. Aggarwal, An Introduction to Outlier Analysis. Springer Int. Publ., 2017, pp. 1–34.
V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, 2009.
J. Lu, A. Liu, F. Dong, G. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 12, pp. 2346–2363, 2018.
H. Been, E. Staal, and E. Keiholz, Azure Infrastructure as Code: With ARM Templates and Bicep. Simon & Schuster, 2022.
R. Juntunen, “OpenShift from the enterprise fleet management context, comparison,” 2020.
T. T. Nguyen, Y. J. Yeom, T. Kim, D. H. Park, and S. Kim, “Horizontal pod autoscaling in Kubernetes for elastic container orchestration,” Sensors, vol. 20, no. 16, p. 4621, 2020.
J. Moses, “Resource auto-scaling in Kubernetes: Techniques and tools,”.
Y. J. Kim, M. Junczys-Dowmunt, H. Hassan, A. F. Aji, K. Heafield, R. Grundkiewicz, and N. Bogoychev, “From research to production and back: Ludicrously fast neural machine translation,” in Proc. 3rd Workshop Neural Generation Translation (EMNLP-IJCNLP), 2019, pp. 280–288.
R. Gaikwad, S. Deshpande, R. Vaidya, and M. Bhate, “A framework design for algorithmic IT operations (AIOps),” Design Eng., pp. 2037–2044, 2021.
J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, no. 2, pp. 74–80, 2013.
M. Barsalou, “Root cause analysis in quality 4.0: a scoping review of current state and perspectives,” TEM J., vol. 12, no. 1, pp. 73–79, 2023.
M. Onkamo and S. T. Rahman, Artificial Intelligence for IT Operations. 2023.
Z. Nishtar and J. Afzal, “A review of real-time monitoring of hybrid energy systems by using artificial intelligence and IoT,” Pak. J. Eng. Technol., vol. 6, no. 3, pp. 8–15, 2023.
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should I trust you?’ Explaining the predictions of any classifier,” in Proc. 22nd ACM SIGKDD, 2016, pp. 1135–1144.
S. M. Lundberg and S. I. Lee, “A unified approach to interpreting model predictions,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.