AI-Driven Incident Management in Microservices: A Scalable and Cost-Effective Framework for Proactive Site Reliability

Authors

  • Sunil Agarwal Software Engineering Technical Lead, USA Author

DOI:

https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_04_002

Keywords:

Microservices, AI, Cost, Scalability

Abstract

The current paper develops a scalable AI-backed framework to manage microservices-based architecture, and incident management that would help to promote proactive site reliability. By incorporating anomaly detection, smart correlations of alerts and autonomous measures, the system overwhelms Mean Time to Resolution (MTTR) as well as the recurrence of incidents. According to the deployment results in real life, a better detection rate of anomalies, reduced overheads in operations, and availability of services were noted. The framework makes use of multi-source telemetry, reinforcement learning, bots, and explainable AI models in decision support. It is able to handle hybrid and multi-cloud environments in a faultless way, which proposes an engaging experience in terms of being cost effective and the smart way to self-healing of systems in current enterprise IT systems.

References

Saxena Moreschini, S., Pour, S., Lanese, I., Balouek, D., Bogner, J., Li, X., Pecorelli, F., Soldani, J., Truyen, E., & Taibi, D. (2025). AI Techniques in the Microservices Life-Cycle: a Systematic Mapping Study. Computing, 107(4). https://doi.org/10.1007/s00607-025-01432-z

Kaul, D. (2020). AI-Driven Fault Detection and Self-Healing Mechanisms in Microservices Architectures for Distributed Cloud Environments. International Journal of Intelligent Automation and Computing, 3(7), 1–20. Retrieved from https://research.tensorgate.org/index.php/IJIAC/article/view/152

Zhou, D. Z., & Fokaefs, M. (2024). AI Assistants for Incident Lifecycle in a Microservice Environment: A Systematic Literature review. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2410.04334

Lee, C., Yang, T., Chen, Z., Su, Y., & Lyu, M. R. (2023). EADRO: an End-to-End troubleshooting framework for microservices on multi-source data. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2302.05092

Akmeemana, L., Attanayake, C., Faiz, H., & Wickramanayake, S. (2025). GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks. arXiv preprint arXiv:2504.00058. https://arxiv.org/abs/2504.00058

Kohyarnejadfard, I., Aloise, D., Azhari, S. V., & Dagenais, M. R. (2022). Anomaly detection in microservice environments using distributed tracing data analysis and NLP. Journal of Cloud Computing Advances Systems and Applications, 11(1). https://doi.org/10.1186/s13677-022-00296-4

Soldani, J., & Brogi, A. (2021). Anomaly Detection and Failure root Cause analysis in (Micro)Service-Based Cloud Applications: A survey. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2105.12378

Ramamoorthi, V. (2024). Anomaly Detection and Automated Mitigation for Microservices Security with AI. Applied Research in Artificial Intelligence and Cloud Computing, 7(6), 211–222. Retrieved from https://researchberg.com/index.php/araic/article/view/216

Kurkute, M. V., Parida, P. R., & Kondaveeti, D. (2024, January 2). Automating IT service management in manufacturing: A deep learning approach to predict incident resolution time and optimize workflow. https://jairajournal.org/index.php/publication/article/view/2

Singh, N. S. (2025). Decentralized security Mechanisms for AI-Driven wireless networks: integrating blockchain and federated learning. International Journal of Scientific Research in Computer Science Engineering and Information Technology, 11(2), 1693–1703. https://doi.org/10.32628/cseit25112537

Downloads

Published

2025-07-18

How to Cite

AI-Driven Incident Management in Microservices: A Scalable and Cost-Effective Framework for Proactive Site Reliability. (2025). ISCSITR- INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND ENGINEERING (ISCSITR-IJCSE) - ISSN: 3067-7394, 6(4), 15-28. https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_04_002