High Performance Computing (Hpc) in the Cloud: A Proactive Fault Tolerance (Pft) Strategy
Keywords:
High Performance computing, Cloud computing, computation-intensive, Proactive Fault ToleranceAbstract
The High Performance Computing (HPC) applications benefit from the new paradigms for computers, capacity, and adaptable responses provided by cloud computing. For instance, the Hardware as a Service (HaaS) paradigm enables individuals to provide several Virtual Machines (VMs) for applications that need a lot of computing. Any execution error would require re-running applications, which would waste time, money, and energy since the HPC system on the cloud uses a lot of VMs and electrical components. In this research, the execution time on the clock and the cost when mistakes occur, we provided a Proactive Fault Tolerance (PFT) strategy to High Performance Computing systems in the cloud. Additionally, we created an enhanced PFT technique for cloud-based HPC systems. Before predicting a failure, our approach does not depend on a spare node. Also, we created a model cost for running computing-heavy apps on cloud HPC servers. To evaluate the effectiveness of our strategy, we looked at the monetary costs associated with supplying spare nodes and checkpointing PFT. Our experimental findings from a genuine cloud execution environment demonstrate that executing computation-intensive apps in the cloud may lower costs and execution times by up to 30%. Our PFT technique for HPC in the cloud may minimize the occurrence of checkpointing of computation-exhaustive applications by up to fifty percent when compared to existing PFT approaches.
Downloads
References
Wada, I., 2018. Cloud computing implementation in libraries: A synergy for library services optimization. International Journal of Library and Information Science, 10(2), pp.17-27.
Negru, C., Mocanu, M., Cristea, V., Sotiriadis, S. and Bessis, N., 2017. Analysis of power consumption in heterogeneous virtual machine environments. Soft Computing, 21, pp.4531-4542.
Kumari, P. and Kaur, P., 2021. A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences, 33(10), pp.1159-1176.
Jadhav, S. B. ., & Kodavade, D. V. . (2023). Enhancing Flight Delay Prediction through Feature Engineering in Machine Learning Classifiers: A Real Time Data Streams Case Study. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2s), 212–218. https://doi.org/10.17762/ijritcc.v11i2s.6064
Gunawi, H.S., Suminto, R.O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C. and Srinivasan, D., 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3), pp.1-26.
Bharany, S., Badotra, S., Sharma, S., Rani, S., Alazab, M., Jhaveri, R.H. and Gadekallu, T.R., 2022. Energy efficient fault tolerance techniques in green cloud computing: A systematic survey and taxonomy. Sustainable Energy Technologies and Assessments, 53, p.102613.
Ashraf, R.A., Hukerikar, S. and Engelmann, C., 2018, March. Shrink or substitute: handling process failures in HPC systems using in-situ recovery. In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (pp. 178-185). IEEE.
Ragunthar, T., Ashok, P., Gopinath, N. and Subashini, M., 2021. A strong reinforcement parallel implementation of k-means algorithm using message passing interface. Materials Today: Proceedings, 46, pp.3799-3802.
Wang, G.G., Cai, X., Cui, Z., Min, G. and Chen, J., 2017. High Performance computing for cyber-physical social systems by using an evolutionary multi-objective optimization algorithm. IEEE Transactions on Emerging Topics in Computing, 8(1), pp.20-30.
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., Gschwandtner, P., Lemarinier, P., Markidis, S., Jordan, H. and Fahringer, T., 2018. A taxonomy of task-based parallel programming technologies for High Performance computing. The Journal of Supercomputing, 74(4), pp.1422-1434.
Ross, R.B., Amvrosiadis, G., Carns, P., Cranor, C.D., Dorier, M., Harms, K., Ganger, G., Gibson, G., Gutierrez, S.K., Latham, R. and Robey, B., 2020. Mochi: Composing data services for High Performance computing environments. Journal of Computer Science and Technology, 35, pp.121-144.
Hutchinson, M.S., 2020. Applying High Performance computing to early fusion video action recognition (Doctoral dissertation, Massachusetts Institute of Technology).
Goar, D. V. . (2021). Biometric Image Analysis in Enhancing Security Based on Cloud IOT Module in Classification Using Deep Learning- Techniques. Research Journal of Computer Systems and Engineering, 2(1), 01:05. Retrieved from https://technicaljournals.org/RJCSE/index.php/journal/article/view/9
Tosson, A., 2020. The way to a smarter community: exploring and exploiting data modeling, big data analytics, High Performance computing, and artificial intelligence techniques for applications of 2D energy-dispersive detectors in the crystallography community.
Li, C. and Zhao, Y., 2019. Traffic route optimization based on cloud computing parallel ACS. International Journal of Information and Communication Technology, 14(2), pp.204-217.
Kairi, A., Gagan, S., Bera, T. and Chakraborty, M., 2019. DNA Cryptography-Based Secured Weather Prediction Model in High Performance Computing. In Proceedings of International Ethical Hacking Conference 2018: eHaCON 2018, Kolkata, India (pp. 103-114). Springer Singapore.
Posey, B., Deer, A., Gorman, W., July, V., Kanhere, N., Speck, D., Wilson, B. and Apon, A., 2019, November. On-demand urgent High Performance computing utilizing the google cloud platform. In 2019 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC) (pp. 13-23). IEEE.
Catak, F.O. and Balaban, M.E., 2013. CloudSVM: training an SVM classifier in cloud computing systems. In Pervasive Computing and the Networked World: Joint International Conference, ICPCA/SWS 2012, Istanbul, Turkey, November 28-30, 2012, Revised Selected Papers (pp. 57-68). Springer Berlin Heidelberg.
Dogani, J., Khunjush, F., Mahmoudi, M.R. and Seydali, M., 2023. Multivariate workload and resource prediction in cloud computing using CNN and GRU by attention mechanism. The Journal of Supercomputing, 79(3), pp.3437-3470.
Arif, M., Ajesh, F., Shamsudheen, S. and Shahzad, M., 2022. Secure and Energy-Efficient Computational Offloading Using LSTM in Mobile Edge Computing. Security And Communication Networks, 2022, pp.1-13.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.