Architectural Optimization Techniques for High-Volume Batch Processing in Hadoop Ecosystems
Keywords:
Hadoop Ecosystem, Batch Processing Optimization, MapReduce Performance, Distributed Computing Architecture, YARN Resource Management.Abstract
The massive increase in big data has led to the need to have strong and scalable structures that can effectively process huge amounts of data. The paper explores methods of optimization of architecture of high-volume batching in Hadoop ecosystems to solve key performance bottlenecks that hinder throughput and resource usage. We thoroughly analyze the progressive approaches such as dynamic resource allocation by YARN optimization, data locality by force, speculative execution optimization, and clever partitioning approaches that unintelligently improve MapReduce and Apache Spark job execution. Moreover, the paper discusses compression codec choice, columnar data storage, including ORC and Parquet, and data pipeline orchestration with Apache Oozie and Apache Airflow to reduce latency and ensure the use of maximum cluster efficiency. Experimental evidence indicates that adaptive scheduling algorithms together with optimized input/output configurations can be used to achieve considerable job completion time reduction with empirical results showing a range of 40-65% job completion time reductions between heterogeneous workloads. The suggested architecture framework offers practitioners and system architects with practical guidelines of how they can implement production-grade Hadoop systems that can support large-scale batch workloads of the enterprise. Results highlight the importance of multi-layered optimization, which is holistic and includes the hardware configuration, software optimization, and workflow optimization, in order to achieve peak performance in contemporary distributed data processing infrastructures.
Downloads
References
Azeroual, O.; Theel, H. The Effects of Using Business Intelligence Systems on an Excellence Management and Decision-Making Process by Start-Up Companies: A Case Study. Int. J. Manag. Sci. Bus. Adm. 2018, 4, 30–40. [Google Scholar] [CrossRef]
Dittrich, J.; Quiané-Ruiz, J.-A. Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 2012, 5, 2014–2015. [Google Scholar] [CrossRef]
Madden, S. From Databases to Big Data. IEEE Internet Comput. 2012, 16, 4–6. [Google Scholar] [CrossRef]
Meng, X.-L. COVID-19: A Massive Stress Test with Many Unexpected Opportunities (for Data Science). Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]
Podkul, A.; Vittert, L.; Tranter, S.; Alduncin, A. The Coronavirus Exponential: A Preliminary Investigation into the Public’s Understanding. Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]
He, X.; Lin, X. Challenges and Opportunities in Statistics and Data Science: Ten Research Areas. Harv. Data Sci. Rev. 2020. [Google Scholar] [CrossRef]
Casado, R.; Younas, M. Emerging trends and technologies in big data processing. Concurr. Comput. Pract. Exp. 2014, 27, 2078–2091. [Google Scholar] [CrossRef]
Chen, H.; Chiang, R.H.L.; Storey, V.C. Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q. 2012, 36, 1165. [Google Scholar] [CrossRef]
Kwon, O.; Lee, N.; Shin, B. Data quality management, data usage experience and acquisition intention of big data analytics. Int. J. Inf. Manag. 2014, 34, 387–394.
Xiang, D.; Wu, Y.; Shang, P.; Jiang, J.; Wu, J.; Yu, K. RB-storm: Resource balance scheduling in apache storm. In Proceedings of the 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), Hamamatsu, Japan, 9–13 July 2017; pp. 419–423. [Google Scholar] [CrossRef]
Yamato, Y.; Kumazaki, H.; Fukumoto, Y. Proposal of lambda architecture adoption for real time predictive maintenance. In Proceedings of the Fourth International Symposium on Computing and Networking (CANDAR), Hiroshima, Japan, 22–25 November 2015; pp. 713–715. [Google Scholar] [CrossRef]
Kim, H.; Madhvanath, S.; Sun, T. Hybrid active learning for non-stationary streaming data with asynchronous labeling. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; pp. 287–292. [Google Scholar] [CrossRef]
Pal, G.; Li, G.; Atkinson, K. Big data real time ingestion and machine learning. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 25–31. [Google Scholar] [CrossRef]
Lee, C.H.; Lin, C.Y. Implementation of lambda architecture: A restaurant recommender system over apache mesos. In Proceedings of the 31st International Conference on Advanced Information Networking and Applications (AINA), Taipei, Taiwan, 27–29 March 2017; pp. 979–985. [Google Scholar] [CrossRef]
Batyuk, A.; Voityshyn, V. Apache storm based on topology for real-time processing of streaming data from social networks. In Proceedings of the 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 23–27 August 2016; pp. 345–349. [Google Scholar] [CrossRef]
Hanif, M.; Yoon, H.; Jang, S.; Lee, C. An adaptive SLA-based data flow mechanism for stream processing engines. In Proceedings of the 2017 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Korea, 18–20 October 2017; pp. 81–86. [Google Scholar] [CrossRef]
Hu, Y.; Koren, Y.; Volinsky, C. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 263–272. [Google Scholar] [CrossRef]
Wang, J.; Peng, X.; Xing, Z.; Fu, K.; Zhao, W. Contextual recommendation of relevant program elements in an interactive feature location process. In Proceedings of the 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM), Shanghai, China, 17–18 September 2017; pp. 61–70. [Google Scholar] [CrossRef]
Ren, Y.; Tomko, M.; Salim, F.D.; Chan, J.; Clarke, C.; Sanderson, M. A location-query-browse graph for contextual recommendation. IEEE Trans. Knowl. Data Eng. 2018, 30, 204–218.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


