Data Cataloging and Lineage Tracking in Data Engineering Ecosystems

Authors

  • REDDY AKSHARA Author

Keywords:

Data Cataloging, Data Lineage, Data Engineering, Metadata Management, Data Governance, Data Discovery, Automated Metadata Extraction, Data Quality, Data Compliance, Cloud Data Management

Abstract

As organizations increasingly rely on data-driven decision-making, the need for effective data cataloging and lineage tracking has become critical in modern data engineering ecosystems. Data cataloging provides a centralized inventory of data assets, enhancing discoverability, and promoting data reuse, while lineage tracking ensures transparency by documenting the data's journey through complex workflows. This paper explores the architecture and implementation of data cataloging and lineage tracking systems, highlighting the role of metadata management, governance frameworks, and automated tools. We examine the challenges of scalability, integration with diverse data sources, and maintaining accuracy in data lineage tracking, especially in distributed and cloud-based environments. Additionally, the paper discusses the application of AI and machine learning techniques for automated metadata extraction and lineage detection. Case studies demonstrate the effectiveness of these systems in improving data quality, regulatory compliance, and operational efficiency, underscoring their value in enabling more robust and trustworthy data engineering processes.

References

Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., & Seufert, S. (2018). Automatically tracking metadata and provenance in data science scripts. Proceedings of the 2018 International Conference on Management of Data, 265-278. doi:10.1145/3183713.3190657.

Agarwal, R., & Dhar, V. (2014). Big data, data science, and analytics: The opportunity and challenge for IS research. Information Systems Research, 25(3), 443-448. doi:10.1287/isre.2014.0546.

Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Goods: Organizing Google’s datasets. Proceedings of the 2016 International Conference on Management of Data, 795-806. doi:10.1145/2882903.2903730.

Abadi, D. J., Boncz, P., Harizopoulos, S., Idreos, S., & Madden, S. (2016). The Beckman report on database research. Communications of the ACM, 59(2), 92-99. doi:10.1145/2845915.

Chebotko, A., Kashlev, A., & Lu, S. (2015). A big data modeling methodology for Apache Cassandra. Proceedings - 2015 IEEE International Congress on Big Data, 238-245. doi:10.1109/BigDataCongress.2015.43.

Vassiliadis, P., & Karagiannis, G. (2018). Conceptual modeling for ETL processes. Journal of Data Semantics, 7(3), 207-224. doi:10.1007/s13740-018-0084-8.

Hartmann, P. M., Zaki, M., Feldmann, N., & Neely, A. (2016). Capturing value from big data – A taxonomy of data-driven business models used by start-up firms. International Journal of Operations & Production Management, 36(10), 1382-1406. doi:10.1108/IJOPM-02-2014-0098.

Ravat, F., & Teste, O. (2016). Data lineage analysis: A survey. International Journal of Data Warehousing and Mining, 12(4), 46-68. doi:10.4018/IJDWM.2016100104.

Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E. B., & Turaga, D. (2018). Learning feature engineering for classification. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2529-2535. doi:10.24963/ijcai.2018/351.

Janus, A., Nagy, M., & Frydman, C. (2021). Data lineage visualization and data governance: A comparative study. Information Processing & Management, 58(3), 102501. doi:10.1016/j.ipm.2020.102501.

Downloads

Published

2024-08-18