Generative AI for Data Engineering: A Seven-Stage Orchestration Framework for LLM-Powered Code Generation
Keywords:
Generative AI, Large Language Models, Code Generation, Data Engineering, Orchestration Framework, Enterprise ArchitectureAbstract
Data engineering organizations have encountered difficulties with productivity, with platform complexity and the requirement to use multiple technologies, programming languages, and frameworks increasing the effort required to develop data pipelines. Maintaining existing pipelines is challenging and costly with changing requirements, modernization, and poor documentation relative to implementation, complicating the transfer of knowledge and debugging. We propose a seven-stage orchestration architecture to apply LLMs in enterprise data engineering workflows to close the divide between LLMs' theoretical code generation capabilities and practical deployments of such systems in strictly regulated environments․ The architecture implements a process that leverages specification ingestion, retrieval augmented generation (RAG), multi-stage code generation with semantic validation, auto documentation writing, multi-layer security scanning, confidence-gated human-in-the-loop review, CI/CD deployment, and reinforcement feedback-based continuous learning to govern the LLMs. We adopt enterprise guardrails like data classifications‚ metadata-only retrieval‚ generation scope limits‚ and immutable audit trails to ensure security‚ regulatory compliance‚ and motivated assurance․ Recent article are in code generation literature show that multi-turn synthesis, bidirectional context modeling, and human feedback can substantially improve generation effectiveness, which informs our design choices. We propose a thorough architecture for responsibly deploying LLMs for enterprise data engineering and plan to validate this approach with production deployment of LLMs in banking data platforms.
Downloads
References
Erik Nijkamp et al., "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis," arXiv, 2023. DOI: https://doi.org/10.48550/arXiv.220313474 [Online]. Available: https://arxiv.org/abs/2203.13474
Jagadeesan Srinivasan et al., "Detection and analysis of prompt injection in Indian multilingual large language models," Sci Rep (2026). https://doi.org/10.1038/s41598-026-43883-0 [Online]. Available: https://www.nature.com/articles/s41598-026-43883-0_reference.pdf
Xing Xu et al., "Structure-Aware Lightweight Document-Level Event Extraction via Code-Based Large Language Models," Electronics 2026, 15(6), 1187; https://doi.org/10.3390/electronics15061187 [Online]. Available: https://www.mdpi.com/2079-9292/15/6/1187
Leonardo Ranaldi et al., "Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Question Answering Task," Findings of the Association for Computational Linguistics: EACL 2026, pages 697–716, March 24-29, 2026. [Online]. Available: https://aclanthology.org/2026.findings-eacl.35.pdf
Haodong Chen et al., "Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering," Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (December 2025) https://doi.org/10.1145/3767695.3769488 [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3767695.3769488
Shunyu Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," 37th Conference on Neural Information Processing Systems (NeurIPS 2023). [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf
Federico Cassano et al., "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation," "Multipl-e: A scalable and polyglot approach to benchmarking neural code generation," IEEE Transactions on Software Engineering 49.7 (2023): 3675-3691. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10103177
Ning Miao et al., "Self-check: Using LLMs to zero-shot check their own step-by-step reasoning." arXiv preprint arXiv:2308.00436 (2023). [Online]. Available: https://arxiv.org/pdf/2308.00436
Long Ouyang et al., "Training language models to follow instructions with human feedback," Advances in neural information processing systems 35 (2022): 27730-27744. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
Shengyu Zhang et al., "Instruction tuning for large language models: A survey." ACM Computing Surveys 58.7 (2026): 1-36. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3777411
Nicholas Carlini et al., "Extracting training data from large language models," 30th USENIX security symposium (USENIX Security 21). 2021. [Online]. Available: https://www.usenix.org/system/files/sec21-carlini-extracting.pdf
Daniel Fried et al., "Incoder: A generative model for code infilling and synthesis," arXiv preprint arXiv:2204.05999 (2022). [Online]. Available: https://arxiv.org/pdf/2204.05999
Ahmed Soliman et al., "Leveraging pre-trained language models for code generation," Complex & Intelligent Systems 10.3 (2024): 3955-3980. [Online]. Available: https://link.springer.com/content/pdf/10.1007/s40747-024-01373-8.pdf
Sebastian Eggers, "Automating Data Lineage and Pipeline Extraction," Proceedings of the VLDB Endowment. ISSN 2150 (2024): 8097. [Online]. Available: https://www.vldb.org/2024/files/phd-workshop-papers/vldb_phd_workshop_paper_id_11.pdf
Yucheng Hu and Yuxing Lu , "Rag and rau: A survey on retrieval-augmented language models in natural language processing," arXiv preprint arXiv:2404.19543 (2024). [Online]. Available: https://arxiv.org/pdf/2404.19543?
B. Roziere et al., "A systematic evaluation of large language models of code." Proceedings of the 6th ACM SIGPLAN international symposium on machine programming. 2022. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3520312.3534862?utm_source=consensus
Jason Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," Advances in neural information processing systems 35 (2022): 24824-24837. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


