Monitoring and scaling GPU workloads in production with Nvidia DCGM and Prometheus
DOI:
https://doi.org/10.63397/ISCSITR-IJSRAIML_05_02_002Keywords:
GPU monitoring, Nvidia DCGM, Prometheus, Kubernetes, workload scaling, performance optimization, real-time telemetry, cluster management, alerting systems, high-performance computingAbstract
The paper introduces a comprehensive framework for monitoring and dynamically scaling GPU workloads within production-grade computing environments, leveraging the capabilities of the Nvidia Data Center GPU Manager (DCGM) in conjunction with the Prometheus monitoring and alerting toolkit. The proposed architecture is designed to provide granular visibility into GPU performance by collecting a diverse range of telemetry data, including core utilization rates, memory allocation and consumption patterns, power draw, and thermal metrics. This fine-grained data capture enables detailed operational insight, allowing for both reactive and proactive system management.
The collected telemetry is systematically processed and stored in a high-performance time-series database, facilitating continuous, low-latency analysis of workload behavior over time. This design choice ensures that monitoring remains scalable and responsive under production workloads, while supporting sophisticated querying and visualization for operational decision-making. Integration with modern orchestration frameworks—particularly Kubernetes—allows the system to feed live telemetry into automated scaling logic, thereby enabling the cluster to adjust GPU resource allocation in near real time based on evolving workload demands. This closed-loop feedback mechanism ensures that the system can maintain optimal utilization levels, mitigate performance bottlenecks, and uphold service-level objectives without manual intervention.
The experimental evaluation, conducted within a Kubernetes-managed GPU cluster, demonstrates that the proposed monitoring and scaling strategy consistently sustains high GPU utilization while effectively responding to operational anomalies and threshold-based alerts. The results highlight its ability to detect and address potential performance degradations before they impact end users. The study also provides valuable implementation-level insights for practitioners, outlining practical considerations such as network overhead, monitoring granularity trade-offs, and the tuning of alerting thresholds to balance responsiveness with stability.
The authors acknowledge certain limitations inherent in the current system, including constraints in predictive capability and hardware compatibility. The system’s scaling logic is presently reactive, relying on predefined thresholds, and does not yet incorporate predictive models capable of anticipating future demand surges. Furthermore, its hardware coverage is largely limited to Nvidia GPUs, reducing applicability in heterogeneous computing environments. The paper concludes by outlining future work aimed at addressing these gaps, such as incorporating predictive scaling algorithms, extending hardware support beyond Nvidia platforms, and exploring integration with advanced scheduling and workload placement strategies to further optimize GPU utilization in large-scale deployments.
References
Mo, H., Zhu, L., Shi, L., Tan, S., & Wang, S. (2023). HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving. Electronics, 12(1), 240. https://doi.org/10.3390/electronics12010240
Erdelt, P.K. (2021). A Framework for Supporting Repetition and Evaluation in the Process of Cloud-Based DBMS Performance Benchmarking. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. TPCTC 2020. Lecture Notes in Computer Science(), vol 12752. Springer, Cham. https://doi.org/10.1007/978-3-030-84924-5_6
A. Klos, M. Rosenbaum and W. Schiffmann, "Scalable and Highly Available Multi-Objective Neural Architecture Search in Bare Metal Kubernetes Cluster," 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 2021, pp. 605-610, doi: 10.1109/IPDPSW52791.2021.00094.
Yousefzadeh-Asl-Miandoab, E., Robroek, T., & Tozun, P. (2023). Profiling and monitoring deep learning training tasks. In Proceedings of the 3rd Workshop on Machine Learning and Systems (pp. 18–25). Association for Computing Machinery. https://doi.org/10.1145/3578356.3592589
Eilam, T., Bello-Maldonado, P., Bhattacharjee, B., Costa, C., Lee, E. K., & Tantawi, A. (2023). Towards a methodology and framework for AI sustainability metrics. In Proceedings of the 2nd Workshop on Sustainable Computer Systems (Article 13, 7 pp.). Association for Computing Machinery. https://doi.org/10.1145/3604930.3605715
Jiang, Y., Xiong, Y., Qu, L., Luo, C., Tian, C., Cheng, P., & Xiong, Y. (2022). Moneo: Monitoring fine-grained metrics nonintrusively in AI infrastructure. SIGOPS Operating Systems Review, 56(1), 18–25. Association for Computing Machinery. https://doi.org/10.1145/3544497.3544501
Guilbault, S. (2023). Self-service monitoring of HPC and Openstack jobs for users. In Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis (pp. 573–580). Association for Computing Machinery. https://doi.org/10.1145/3624062.3624123
Furusato, F.S., Sarmento, M.F., Aranha, G.H.O., Zago, L.G., Miqueles, E.X. (2022). TEPUI: High-Performance Computing Infrastructure for Beamlines at LNLS/Sirius. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_1
Zhang, H., Li, Y., Huang, Y., Wen, Y., Yin, J., & Guan, K. (2020). MLModelCI: An automatic cloud platform for efficient MLaaS. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 4453–4456). Association for Computing Machinery. https://doi.org/10.1145/3394171.3414535
Mo, H., Zhu, L., Shi, L., Tan, S., & Wang, S. (2023). HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving. Electronics, 12(1), 240. https://doi.org/10.3390/electronics12010240
Gajger, T. (2020). NVIDIA GPU performance monitoring using an extension for Dynatrace OneAgent. Scalable Computing: Practice and Experience, 21(4). https://doi.org/10.12694/scpe.v21i4.1807
Ferikoglou, A. (2020). Resource aware GPU scheduling in Kubernetes infrastructure. Master’s Thesis, National Technical University of Athens. https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.4
Varrette, S. (2021). UL HPC Facility workload analysis. In Energumen Days 2021. University of Luxembourg. https://orbilu.uni.lu/handle/10993/51967