Architectural Optimization Techniques for High-Volume Batch Processing in Hadoop Ecosystems

Hariprasad Pandian

Authors

Hariprasad Pandian

Keywords:

Hadoop Ecosystem, Batch Processing Optimization, MapReduce Performance, Distributed Computing Architecture, YARN Resource Management.

Abstract

The massive increase in big data has led to the need to have strong and scalable structures that can effectively process huge amounts of data. The paper explores methods of optimization of architecture of high-volume batching in Hadoop ecosystems to solve key performance bottlenecks that hinder throughput and resource usage. We thoroughly analyze the progressive approaches such as dynamic resource allocation by YARN optimization, data locality by force, speculative execution optimization, and clever partitioning approaches that unintelligently improve MapReduce and Apache Spark job execution. Moreover, the paper discusses compression codec choice, columnar data storage, including ORC and Parquet, and data pipeline orchestration with Apache Oozie and Apache Airflow to reduce latency and ensure the use of maximum cluster efficiency. Experimental evidence indicates that adaptive scheduling algorithms together with optimized input/output configurations can be used to achieve considerable job completion time reduction with empirical results showing a range of 40-65% job completion time reductions between heterogeneous workloads. The suggested architecture framework offers practitioners and system architects with practical guidelines of how they can implement production-grade Hadoop systems that can support large-scale batch workloads of the enterprise. Results highlight the importance of multi-layered optimization, which is holistic and includes the hardware configuration, software optimization, and workflow optimization, in order to achieve peak performance in contemporary distributed data processing infrastructures.

Downloads

Download data is not yet available.

References

Azeroual, O.; Theel, H. The Effects of Using Business Intelligence Systems on an Excellence Management and Decision-Making Process by Start-Up Companies: A Case Study. Int. J. Manag. Sci. Bus. Adm. 2018, 4, 30–40. [Google Scholar] [CrossRef]

Dittrich, J.; Quiané-Ruiz, J.-A. Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 2012, 5, 2014–2015. [Google Scholar] [CrossRef]

Madden, S. From Databases to Big Data. IEEE Internet Comput. 2012, 16, 4–6. [Google Scholar] [CrossRef]

Meng, X.-L. COVID-19: A Massive Stress Test with Many Unexpected Opportunities (for Data Science). Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]

Podkul, A.; Vittert, L.; Tranter, S.; Alduncin, A. The Coronavirus Exponential: A Preliminary Investigation into the Public’s Understanding. Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]

He, X.; Lin, X. Challenges and Opportunities in Statistics and Data Science: Ten Research Areas. Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]

Casado, R.; Younas, M. Emerging trends and technologies in big data processing. Concurr. Comput. Pract. Exp. 2014, 27, 2078–2091. [Google Scholar] [CrossRef]

Chen, H.; Chiang, R.H.L.; Storey, V.C. Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q. 2012, 36, 1165. [Google Scholar] [CrossRef]

Kwon, O.; Lee, N.; Shin, B. Data quality management, data usage experience and acquisition intention of big data analytics. Int. J. Inf. Manag. 2014, 34, 387–394.

Xiang, D.; Wu, Y.; Shang, P.; Jiang, J.; Wu, J.; Yu, K. RB-storm: Resource balance scheduling in apache storm. In Proceedings of the 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), Hamamatsu, Japan, 9–13 July 2017; pp. 419–423. [Google Scholar] [CrossRef]

Yamato, Y.; Kumazaki, H.; Fukumoto, Y. Proposal of lambda architecture adoption for real time predictive maintenance. In Proceedings of the Fourth International Symposium on Computing and Networking (CANDAR), Hiroshima, Japan, 22–25 November 2015; pp. 713–715. [Google Scholar] [CrossRef]

Kim, H.; Madhvanath, S.; Sun, T. Hybrid active learning for non-stationary streaming data with asynchronous labeling. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; pp. 287–292. [Google Scholar] [CrossRef]

Pal, G.; Li, G.; Atkinson, K. Big data real time ingestion and machine learning. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 25–31. [Google Scholar] [CrossRef]

Lee, C.H.; Lin, C.Y. Implementation of lambda architecture: A restaurant recommender system over apache mesos. In Proceedings of the 31st International Conference on Advanced Information Networking and Applications (AINA), Taipei, Taiwan, 27–29 March 2017; pp. 979–985. [Google Scholar] [CrossRef]

Batyuk, A.; Voityshyn, V. Apache storm based on topology for real-time processing of streaming data from social networks. In Proceedings of the 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 23–27 August 2016; pp. 345–349. [Google Scholar] [CrossRef]

Hanif, M.; Yoon, H.; Jang, S.; Lee, C. An adaptive SLA-based data flow mechanism for stream processing engines. In Proceedings of the 2017 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, 18–20 October 2017; pp. 81–86. [Google Scholar] [CrossRef]

Hu, Y.; Koren, Y.; Volinsky, C. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 263–272. [Google Scholar] [CrossRef]

Wang, J.; Peng, X.; Xing, Z.; Fu, K.; Zhao, W. Contextual recommendation of relevant program elements in an interactive feature location process. In Proceedings of the 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM), Shanghai, China, 17–18 September 2017; pp. 61–70. [Google Scholar] [CrossRef]

Ren, Y.; Tomko, M.; Salim, F.D.; Chan, J.; Clarke, C.; Sanderson, M. A location-query-browse graph for contextual recommendation. IEEE Trans. Knowl. Data Eng. 2018, 30, 204–218.

Architectural Optimization Techniques for High-Volume Batch Processing in Hadoop Ecosystems

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Architectural Optimization Techniques for High-Volume Batch Processing in Hadoop Ecosystems

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By