Architecting Scalable Feature Engineering Pipelines through Automated Machine Learning and Data Mining Techniques in Heterogeneous Data Ecosystems
Keywords:
AutoML, Feature Engineering, Data Mining, Scalable Pipelines, Heterogeneous Data, High-Dimensional Data, ML Automation, Feature TransformationAbstract
Feature engineering remains a critical and resource-intensive phase in the machine learning (ML) lifecycle, especially within large-scale, heterogeneous data ecosystems. This paper investigates how automated machine learning (AutoML) and data mining techniques can be systematically orchestrated to develop scalable and adaptive feature engineering pipelines. We present a synthesis of existing literature and introduce architectural strategies that ensure both computational scalability and semantic alignment across disparate data sources. Visual artifacts such as flowcharts and tabular summaries aid in illustrating the challenges and solutions in constructing robust, automated feature transformation pipelines. Our findings suggest that the integration of AutoML with knowledge-driven feature selection leads to enhanced model performance and generalization across diverse domains.
References
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, Vol. 3, Issue 1.
Kanter, J. & Veeramachaneni, K. (2015). Deep Feature Synthesis: Towards automating data science endeavors. IEEE Big Data Conference, Vol. 3, Issue 4.
Katz, G., et al. (2016). Exploring feature engineering automation with Featuretools. KDD Feature Engineering Workshop, Vol. 2, Issue 1.
Escalante, H.J., et al. (2020). A comparison of AutoML approaches for large-scale classification. Pattern Recognition, Vol. 106, Issue 5.
Aggarwal, C. (2015). Data Mining: The Textbook. ACM Computing Surveys, Vol. 47, Issue 3.
Zöller, M.A. & Huber, M.F. (2019). Benchmark and survey of automated machine learning frameworks. Information Fusion, Vol. 55, Issue 4.
Feurer, M., et al. (2015). Efficient and robust automated machine learning. Advances in Neural Information Processing Systems, Vol. 28, Issue 1.
Thornton, C., et al. (2013). Auto-WEKA: Combined selection and hyperparameter optimization. KDD Conference Proceedings, Vol. 24, Issue 1.
He, X., Zhao, K., & Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, Vol. 212, Issue 8.
Swearingen, T., et al. (2017). At the helm: Semi-automated feature engineering for data scientists. Data Science Journal, Vol. 16, Issue 5.
Olson, R.S. & Moore, J.H. (2016). TPOT: A tree-based pipeline optimization tool. GECCO Conference, Vol. 1, Issue 1.
Chen, T. & Guestrin, C. (2016). XGBoost: Scalable tree boosting system. KDD Proceedings, Vol. 3, Issue 5.
Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS, Vol. 30, Issue 1.
Abid, A., et al. (2021). Gradio: Interfacing AutoML pipelines. MLSys Journal, Vol. 3, Issue 2.
Chen, Y., et al. (2019). Learning to explain: An information-theoretic perspective. JMLR, Vol. 20, Issue 1.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Jinwoo Park (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.