Designing Intelligent Data Engineering Architectures for Automated Data Cleansing and Enrichment

Authors

  • Camus Jean-Paul Data Quality & Automation Engineer, USA. Author

Keywords:

Intelligent Data Engineering, Automated Data Cleansing, Data Enrichment, Machine Learning Pipelines, Data Quality Architecture

Abstract

The exponential growth of data volume, variety, and velocity has made manual data preparation a critical bottleneck in analytics and machine learning pipelines. This paper explores the design of intelligent data engineering architectures that leverage machine learning (ML) and automation to perform scalable and accurate data cleansing and enrichment. We propose a layered, modular architecture that integrates rule-based systems with ML models for tasks such as anomaly detection, entity resolution, and semantic enrichment. The discussion includes a review of existing methodologies, a detailed blueprint for system design, and an analysis of key implementation challenges and performance metrics. The proposed framework aims to significantly reduce time-to-insight and improve data quality for downstream applications.

References

Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data (pp. 2201–2206). Association for Computing Machinery.

Gentyala, R. (2026). AutoFlow: An LLM-Agent Framework for Self-Correcting, MultiStep Data Pipeline Synthesis. European Journal of Advances in Engineering and Technology, 13(1), 1–9. ISSN: 2394-658X.

Dong, X. L., & Srivastava, D. (2015). Big data integration. Proceedings of the VLDB Endowment, 8(12), 2012–2015.

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

Heidari, A., McGrath, J., Ilyas, I. F., & Rekatsinas, T. (2023). HoloDetect: Few-shot learning for error detection. Proceedings of the VLDB Endowment, 16(4), 818–830.

Gentyala, R. (2025). Ethical Artifacts: Engineering Verifiable Audit Trails for Human-in-the-Loop Decisions in ML Data Pipelines. Journal of Scientific and Engineering Research, 12(10), 240–251.

Ilyas, I. F., & Chu, X. (2019). Data cleaning. Association for Computing Machinery.

Kruse, S., Lessmann, S., & Peters, M. (2017). Classification of dirty data: A machine learning perspective. Business & Information Systems Engineering, 59(1), 5–18.

Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.

Gentyala, R. (2025). Bridging the semantic divide: A framework for cross-functional literacy between data and machine learning engineers. European Journal of Advances in Engineering and Technology, 12(4), 91–100.

Raman, V., & Hellerstein, J. M. (2001). Potter’s Wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases (pp. 381–390). Morgan Kaufmann Publishers Inc.

Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), 1781–1794.

Gentyala, R. (2025). Mapping imperfections to instruments: A unified taxonomy for data engineering in behavioral economics. International Journal of Data Engineering Research and Development (IJDERD), 2(1), 10–30. https://doi.org/10.34218/IJDERD_02_01_002

Gentyala, R. (2025). Benchmarking Prompt Architectures: A Quantitative Study of Contextual and Decomposed Prompting for Complex ETL Code Generation. ISCSITR - International Journal of Computer Science and Engineering (ISCSITR-IJCSE), 6(3), 39–60. https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_03_004

Yakout, M., Ganjam, K., Chakrabarti, K., & Chaudhuri, S. (2012). InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (pp. 97–108). Association for Computing Machinery.

Downloads

Published

2026-02-10