Architecting Scalable Data Pipelines for Big Data: A Data Engineering Perspective

Authors

  • Swathi Chundru, Praveen Kumar Maroju

Keywords:

approaches, architecting, frameworks, environments

Abstract

The exponential growth of data across various industries has necessitated the development of robust and scalable data pipelines to manage, process, and analyze large volumes of data efficiently. Traditional data processing frameworks often struggle with the sheer volume, variety, and velocity of modern data streams, leading to bottlenecks and inefficiencies. This paper explores the key architectural principles and design patterns for building scalable data pipelines, focusing on batch processing and real-time streaming pipelines. We examine various challenges associated with big data, such as data integration, fault tolerance, and scalability, and discuss how modern data engineering tools and frameworks can be leveraged to overcome these challenges. Through case studies and industry examples, the paper highlights practical approaches to architecting scalable data pipelines that meet the demands of big data environments.

Downloads

Download data is not yet available.

References

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Boston, MA, USA, 2010, pp. 10-10.

J. Kreps, N. Narkhede, and J. Rao, "Kafka: A Distributed Messaging System for Log Processing," in Proceedings of the NetDB '11: Sixth International Workshop on Networking Meets Databases, Athens, Greece, 2011.

T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle, "MillWheel: Fault-Tolerant Stream Processing at Internet Scale," in Proceedings of the VLDB Endowment, vol. 6, no. 11, pp. 1033-1044, Aug. 2013.

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: Wait-free Coordination for Internet-scale Systems," in Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC '10), Boston, MA, USA, 2010.

F. Hueske, M. Peters, M. J. Sax, and A. Toshniwal, "The Dataflow Model in Apache Flink™," in IEEE Data Engineering Bulletin, vol. 38, no. 4, pp. 28-38, Dec. 2015.

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Communications of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008.

D. Jiang, B. C. Ooi, L. Shi, and S. Wu, "The Performance of MapReduce: An In-depth Study," in Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 472-483, Sep. 2010.

A. Ghosh, R. Iyer, and V. S. Iyengar, "Scalable Real-time Analytics on Big Data Using Twitter Storm," in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12), Maui, HI, USA, 2012, pp. 2411-2414.

A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, N. Zhang, S. M. Subramanian, and R. Murthy, "Hive: A Warehousing Solution over a Map-Reduce Framework," in Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626-1629, Aug. 2009.

L. George, HBase: The Definitive Guide. Sebastopol, CA, USA: O'Reilly Media, 2011.

S. Madden, "From Databases to Big Data," in IEEE Internet Computing, vol. 16, no. 3, pp. 4-6, May-June 2012.

E. Sammer, Hadoop Operations. Sebastopol, CA, USA: O'Reilly Media, 2012.

J. Chen, K. Hsieh, G. Durand, and G. Roschke, "Auto-scaling Data Pipelines for Big Data Analytics," in IEEE International Conference on Cloud Engineering (IC2E), Orlando, FL, USA, 2017, pp. 210-215.

D. Bermbach, M. Klems, S. Tai, and M. Menzel, "Metastores in the Cloud: A Comparative Analysis," in IEEE 4th International Conference on Cloud Computing, Washington, DC, USA, 2011, pp. 183-190.

A. Dutta, S. Ghosh, A. Nandi, A. Pal, and S. Sengupta, "Building Scalable and Reliable Data Pipelines with Apache Kafka," in IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 2583-2592.

D. Lin, H. Arora, and M. S. Bhardwaj, "DataOps: Applying DevOps to Data Engineering," in IEEE Software, vol. 37, no. 6, pp. 74-81, Nov.-Dec. 2020.

Downloads

Published

06.08.2024

How to Cite

Swathi Chundru. (2024). Architecting Scalable Data Pipelines for Big Data: A Data Engineering Perspective. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 1855–1870. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7137

Issue

Section

Research Article