Generative AI for Data Engineering: A Seven-Stage Orchestration Framework for LLM-Powered Code Generation

Mosaic Basha Syed

Authors

Mosaic Basha Syed

Keywords:

Generative AI, Large Language Models, Code Generation, Data Engineering, Orchestration Framework, Enterprise Architecture

Abstract

Data engineering organizations have encountered difficulties with productivity, with platform complexity and the requirement to use multiple technologies, programming languages, and frameworks increasing the effort required to develop data pipelines. Maintaining existing pipelines is challenging and costly with changing requirements, modernization, and poor documentation relative to implementation, complicating the transfer of knowledge and debugging. We propose a seven-stage orchestration architecture to apply LLMs in enterprise data engineering workflows to close the divide between LLMs' theoretical code generation capabilities and practical deployments of such systems in strictly regulated environments․ The architecture implements a process that leverages specification ingestion, retrieval augmented generation (RAG), multi-stage code generation with semantic validation, auto documentation writing, multi-layer security scanning, confidence-gated human-in-the-loop review, CI/CD deployment, and reinforcement feedback-based continuous learning to govern the LLMs. We adopt enterprise guardrails like data classifications‚ metadata-only retrieval‚ generation scope limits‚ and immutable audit trails to ensure security‚ regulatory compliance‚ and motivated assurance․ Recent article are in code generation literature show that multi-turn synthesis, bidirectional context modeling, and human feedback can substantially improve generation effectiveness, which informs our design choices. We propose a thorough architecture for responsibly deploying LLMs for enterprise data engineering and plan to validate this approach with production deployment of LLMs in banking data platforms.

Downloads

Download data is not yet available.

References

Erik Nijkamp et al., "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis," arXiv, 2023. DOI: https://doi.org/10.48550/arXiv.220313474 [Online]. Available: https://arxiv.org/abs/2203.13474

Jagadeesan Srinivasan et al., "Detection and analysis of prompt injection in Indian multilingual large language models," Sci Rep (2026). https://doi.org/10.1038/s41598-026-43883-0 [Online]. Available: https://www.nature.com/articles/s41598-026-43883-0_reference.pdf

Xing Xu et al., "Structure-Aware Lightweight Document-Level Event Extraction via Code-Based Large Language Models," Electronics 2026, 15(6), 1187; https://doi.org/10.3390/electronics15061187 [Online]. Available: https://www.mdpi.com/2079-9292/15/6/1187

Leonardo Ranaldi et al., "Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Question Answering Task," Findings of the Association for Computational Linguistics: EACL 2026, pages 697–716, March 24-29, 2026. [Online]. Available: https://aclanthology.org/2026.findings-eacl.35.pdf

Haodong Chen et al., "Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering," Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (December 2025) https://doi.org/10.1145/3767695.3769488 [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3767695.3769488

Shunyu Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," 37th Conference on Neural Information Processing Systems (NeurIPS 2023). [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf

Federico Cassano et al., "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation," "Multipl-e: A scalable and polyglot approach to benchmarking neural code generation," IEEE Transactions on Software Engineering 49.7 (2023): 3675-3691. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10103177

Ning Miao et al., "Self-check: Using LLMs to zero-shot check their own step-by-step reasoning." arXiv preprint arXiv:2308.00436 (2023). [Online]. Available: https://arxiv.org/pdf/2308.00436

Long Ouyang et al., "Training language models to follow instructions with human feedback," Advances in neural information processing systems 35 (2022): 27730-27744. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf

Shengyu Zhang et al., "Instruction tuning for large language models: A survey." ACM Computing Surveys 58.7 (2026): 1-36. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3777411

Nicholas Carlini et al., "Extracting training data from large language models," 30th USENIX security symposium (USENIX Security 21). 2021. [Online]. Available: https://www.usenix.org/system/files/sec21-carlini-extracting.pdf

Daniel Fried et al., "Incoder: A generative model for code infilling and synthesis," arXiv preprint arXiv:2204.05999 (2022). [Online]. Available: https://arxiv.org/pdf/2204.05999

Ahmed Soliman et al., "Leveraging pre-trained language models for code generation," Complex & Intelligent Systems 10.3 (2024): 3955-3980. [Online]. Available: https://link.springer.com/content/pdf/10.1007/s40747-024-01373-8.pdf

Sebastian Eggers, "Automating Data Lineage and Pipeline Extraction," Proceedings of the VLDB Endowment. ISSN 2150 (2024): 8097. [Online]. Available: https://www.vldb.org/2024/files/phd-workshop-papers/vldb_phd_workshop_paper_id_11.pdf

Yucheng Hu and Yuxing Lu , "Rag and rau: A survey on retrieval-augmented language models in natural language processing," arXiv preprint arXiv:2404.19543 (2024). [Online]. Available: https://arxiv.org/pdf/2404.19543?

B. Roziere et al., "A systematic evaluation of large language models of code." Proceedings of the 6th ACM SIGPLAN international symposium on machine programming. 2022. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3520312.3534862?utm_source=consensus

Jason Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," Advances in neural information processing systems 35 (2022): 24824-24837. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

Generative AI for Data Engineering: A Seven-Stage Orchestration Framework for LLM-Powered Code Generation

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By

Generative AI for Data Engineering: A Seven-Stage Orchestration Framework for LLM-Powered Code Generation

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By