Assessing the Role of Scalable Machine Learning Architectures in Enhancing Predictive Accuracy and Decision Latency in Real-Time Big Data Analytics Environments

Authors

  • Anna Svensson BI Performance Analyst, Sweden Author

Keywords:

Scalable Machine Learning, Predictive Accuracy, Decision Latency, Big Data Analytics, Real-Time Processing, Stream Processing, Distributed ML Systems

Abstract

In the era of high-velocity, high-volume data streams, real-time analytics plays a crucial role in enabling timely and accurate decision-making. This paper examines the influence of scalable machine learning (ML) architectures on enhancing predictive accuracy and reducing decision latency in big data environments. Through a comparative analysis of distributed ML systems such as Apache Spark MLlib, TensorFlow Extended (TFX), and FlinkML, the research explores how architectural decisions affect performance metrics across various real-time analytical applications. The study combines insights from previous works and experimental benchmarking using stream-based datasets. Results suggest that architecture scalability significantly enhances prediction efficacy while maintaining low latency under increased data loads, making it indispensable for next-generation intelligent analytics platforms.

References

Zaharia, M., et al. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Communications of the ACM, Vol. 56, No. 6.

Ghemawat, S., & Dean, J. (2004). MapReduce: Simplified Data Processing on Large Clusters. OSDI, Vol. 38, No. 5.

Li, M., et al. (2014). Scaling Distributed Machine Learning with the Parameter Server. OSDI, Vol. 48, No. 7.

Chen, T., et al. (2016). XGBoost: A Scalable Tree Boosting System. KDD, Vol. 49, No. 8.

Kraska, T., et al. (2018). The Case for Learned Index Structures. SIGMOD, Vol. 52, No. 2.

Meng, X., et al. (2016). MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, Vol. 17, No. 34.

Abadi, M., et al. (2017). TensorFlow: A System for Large-Scale Machine Learning. OSDI, Vol. 51, No. 10.

Carbone, P., et al. (2015). Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Engineering Bulletin, Vol. 38, No. 4.

Zaharia, M., et al. (2010). Spark: Cluster Computing with Working Sets. USENIX, Vol. 45, No. 2.

Herodotou, H., et al. (2011). Starfish: A Self-Tuning System for Big Data Analytics. CIDR, Vol. 9, No. 1.

Murray, D. G., et al. (2013). Naiad: A Timely Dataflow System. SOSP, Vol. 47, No. 6.

Duchi, J., et al. (2012). Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, Vol. 3, No. 1.

McMahan, H. B., et al. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS, Vol. 54, No. 5.

Zaharia, M., et al. (2013). Discretized Streams: Fault-Tolerant Streaming Computation at Scale. SOSP, Vol. 46, No. 1.

Das, T., et al. (2014). Distributed Stream Processing with Apache Storm. Big Data, Vol. 2, No. 2

Downloads

Published

2023-07-14