Automating Extract, Transform, Load (ETL) Pipelines using Machine Learning Triggered Workflow Optimization

Authors

  • Samyukta Rongala, Godavari Modalavalasa

Keywords:

Data Integration, Data Engineering Solutions, Data Processing, Extract, Transform, Load (ETL) Pipeline Automation, Machine Learning, Workflow Optimization

Abstract

Consideration of the enhanced data processing requirements in the contemporary firm underlines the need to improve methods that can be used to automate ETL processes. This paper provides a machine learning framework used to automate most of the ETL process hence decreasing the number of steps performed manually. This takes advantage of some of the most innovative and sophisticated machine learning technologies to improve the efficiency of data extraction, transformation rules of the data and the loading of the data across the heterogonous systems. It uses anomaly detection models in aspects of data quality with a 95% anomaly detection level and it uses probabilistic imputation in aspect of data loss through achieving only 1% making an 80% enhancement as compared to using traditional methodologies. Algorithms dynamically enhance the component recognition rate to about 98% to enable harmonization of dissimilar datasets. The performance evaluation of the proposed approach resulted in an average saving of 36.49% in total ETL time and 40% in the overall transformation time. Confirming the results of simple scalability tests, it is possible to achieve a constant decrease in the time taken to process the records by 37%-40%, when working with data sets of between 1 million and 10 million records. The presented results demonstrate the value of the proposed framework for improving development cycles, reducing development costs, and ensuring efficient scaling for data-intensive applications. The research aims to identify the following objectives to capture the transformative functionalities of machine learning in enhancing ETL operational processes and present ideal solutions for current complexities encountered in data engineering.

Downloads

Download data is not yet available.

References

Ebadifard, N., Parihar, A., Khmelevsky, Y., Hains, G., Wong, A. and Zhang, F., 2023. Data Extraction, Transformation, and Loading Process Automation for Algorithmic Trading Machine Learning Modelling and Performance Optimization. arXiv preprint arXiv:2312.12774.

Yang, J., He, Y. and Chaudhuri, S., 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. arXiv preprint arXiv:2106.13861.

Pogiatzis, A. and Samakovitis, G., 2020. An event-driven serverless ETL pipeline on AWS. Applied Sciences, 11(1), p.191

Mbata, A., Sripada, Y. and Zhong, M., 2024. A Survey of Pipeline Tools for Data Engineering. arXiv preprint arXiv:2406.08335.

Pekar, A. and Jozsa, R., 2024. Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study. arXiv preprint arXiv:2401.16843.

Gueddoudj, E.Y., Chikh, A. and Attia, A., 2023. Os-ETL: A High-Efficiency, Open-Scala Solution for Integrating Heterogeneous Data in Large-Scale Data Warehousing. Ingénierie des Systèmes d'Information, 28(3).

Moharil, A., Vanschoren, J., Singh, P. and Tamburri, D., 2024. Towards efficient AutoML: a pipeline synthesis approach leveraging pre-trained transformers for multimodal data. Machine Learning, 113(9), pp.7011-7053.

Markov, I.L., Wang, H., Kasturi, N.S., Singh, S., Garrard, M.R., Huang, Y., Yuen, S.W.C., Tran, S., Wang, Z., Glotov, I. and Gupta, T., 2022, August. Looper: An end-to-end ml platform for product decisions. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3513-3523).

Karmaker, S.K., Hassan, M.M., Smith, M.J., Xu, L., Zhai, C. and Veeramachaneni, K., 2021. Automl to date and beyond: Challenges and opportunities. ACM Computing Surveys (CSUR), 54(8), pp.1-36.

Martínez-Prieto, M.A., Cuesta, C.E., Arias, M. and Fernández, J.D., 2015. The solid architecture for real-time management of big semantic data. Future Generation Computer Systems, 47, pp.62-79.

Zöller, M.A. and Huber, M.F., 2021. Benchmark and survey of automated machine learning frameworks. Journal of artificial intelligence research, 70, pp.409-472.

Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y. and Salakhutdinov, R., 2021. Multibench: Multiscale benchmarks for multimodal representation learning. Advances in neural information processing systems, 2021(DB1), p.1

Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M. and Smola, A., 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

Jonas, E., Pu, Q., Venkataraman, S., Stoica, I. and Recht, B., 2017, September. Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 symposium on cloud computing (pp. 445-451).

Fouladi, S., Wahby, R.S., Shacklett, B., Balasubramaniam, K.V., Zeng, W., Bhalerao, R., Sivaraman, A., Porter, G. and Winstein, K., 2017. Encoding, fast and slow:{Low-Latency} video processing using thousands of tiny threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (pp. 363-376).

Fouladi, S., Romero, F., Iter, D., Li, Q., Chatterjee, S., Kozyrakis, C., Zaharia, M. and Winstein, K., 2019. From laptop to lambda: Outsourcing everyday jobs to thousands of transient functional containers. In 2019 USENIX annual technical conference (USENIX ATC 19) (pp. 475-488).

Kim, Y. and Lin, J., 2018, July. Serverless data analytics with flint. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD) (pp. 451-455). IEEE.

Rahman, M.M. and Hasan, M.H., 2019, October. Serverless architecture for big data analytics. In 2019 Global Conference for Advancement in Technology (GCAT) (pp. 1-5). IEEE.

Zhang, S., Luo, X. and Litvinov, E., 2021. Serverless computing for cloud-based power grid emergency generation dispatch. International Journal of Electrical Power & Energy Systems, 124, p.106366.

Pérez, A., Risco, S., Naranjo, D.M., Caballer, M. and Moltó, G., 2019, July. On-premises serverless computing for event-driven data processing applications. In 2019 IEEE 12th International conference on cloud computing (CLOUD) (pp. 414-421). IEEE.

Kuhlenkamp, J., Werner, S., Borges, M.C., El Tal, K. and Tai, S., 2019, December. An evaluation of faas platforms as a foundation for serverless big data processing. In Proceedings of the 12th IEEE/ACM international conference on utility and cloud computing (pp. 1-9).

Wang, L., Li, M., Zhang, Y., Ristenpart, T. and Swift, M., 2018. Peeking behind the curtains of serverless platforms. In 2018 USENIX annual technical conference (USENIX ATC 18) (pp. 133-146).

Lee, H., Satyam, K. and Fox, G., 2018, July. Evaluation of production serverless computing environments. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD) (pp. 442-450). IEEE.

V. Shah and N. Sajnani, “Multi-Class Image Classification using CNN and Tflite”, IJRESM, vol. 3, no. 11, pp. 65–68, Nov. 2020, doi: 10.47607/ijresm.2020.375.

Downloads

Published

05.04.2024

How to Cite

Samyukta Rongala. (2024). Automating Extract, Transform, Load (ETL) Pipelines using Machine Learning Triggered Workflow Optimization . International Journal of Intelligent Systems and Applications in Engineering, 12(3), 4427–4434. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7193

Issue

Section

Research Article