High Performance Computing (Hpc) in the Cloud: A Proactive Fault Tolerance (Pft) Strategy

Sunil  Sharma; Garima  Jain; Preethi  D.; Shambhu  Bhardwaj

Authors

Sunil Sharma Assistant Professor, Department of Computer Science & Engineering, Vivekananda Global University, Jaipur, India
Garima Jain Assistant Professor, Department of Computer Science and Business Systems (CSBS), Noida Institute of Engineering and Technology, Greater Noida, Uttar Pradesh, India
Preethi D. Assistant Professor, Department of Computer Science and IT, Jain (Deemed-to-be University), Bangalore-27, India
Shambhu Bhardwaj Associate Professor, College of Computing Science and Information Technology, Teerthanker Mahaveer University, Moradabad, Uttar Pradesh, India

Keywords:

High Performance computing, Cloud computing, computation-intensive, Proactive Fault Tolerance

Abstract

The High Performance Computing (HPC) applications benefit from the new paradigms for computers, capacity, and adaptable responses provided by cloud computing. For instance, the Hardware as a Service (HaaS) paradigm enables individuals to provide several Virtual Machines (VMs) for applications that need a lot of computing. Any execution error would require re-running applications, which would waste time, money, and energy since the HPC system on the cloud uses a lot of VMs and electrical components. In this research, the execution time on the clock and the cost when mistakes occur, we provided a Proactive Fault Tolerance (PFT) strategy to High Performance Computing systems in the cloud. Additionally, we created an enhanced PFT technique for cloud-based HPC systems. Before predicting a failure, our approach does not depend on a spare node. Also, we created a model cost for running computing-heavy apps on cloud HPC servers. To evaluate the effectiveness of our strategy, we looked at the monetary costs associated with supplying spare nodes and checkpointing PFT. Our experimental findings from a genuine cloud execution environment demonstrate that executing computation-intensive apps in the cloud may lower costs and execution times by up to 30%. Our PFT technique for HPC in the cloud may minimize the occurrence of checkpointing of computation-exhaustive applications by up to fifty percent when compared to existing PFT approaches.

Downloads

Download data is not yet available.

References

Wada, I., 2018. Cloud computing implementation in libraries: A synergy for library services optimization. International Journal of Library and Information Science, 10(2), pp.17-27.

Negru, C., Mocanu, M., Cristea, V., Sotiriadis, S. and Bessis, N., 2017. Analysis of power consumption in heterogeneous virtual machine environments. Soft Computing, 21, pp.4531-4542.

Kumari, P. and Kaur, P., 2021. A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences, 33(10), pp.1159-1176.

Jadhav, S. B. ., & Kodavade, D. V. . (2023). Enhancing Flight Delay Prediction through Feature Engineering in Machine Learning Classifiers: A Real Time Data Streams Case Study. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2s), 212–218. https://doi.org/10.17762/ijritcc.v11i2s.6064

Gunawi, H.S., Suminto, R.O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C. and Srinivasan, D., 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS), 14(3), pp.1-26.

Bharany, S., Badotra, S., Sharma, S., Rani, S., Alazab, M., Jhaveri, R.H. and Gadekallu, T.R., 2022. Energy efficient fault tolerance techniques in green cloud computing: A systematic survey and taxonomy. Sustainable Energy Technologies and Assessments, 53, p.102613.

Ashraf, R.A., Hukerikar, S. and Engelmann, C., 2018, March. Shrink or substitute: handling process failures in HPC systems using in-situ recovery. In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (pp. 178-185). IEEE.

Ragunthar, T., Ashok, P., Gopinath, N. and Subashini, M., 2021. A strong reinforcement parallel implementation of k-means algorithm using message passing interface. Materials Today: Proceedings, 46, pp.3799-3802.

Wang, G.G., Cai, X., Cui, Z., Min, G. and Chen, J., 2017. High Performance computing for cyber-physical social systems by using an evolutionary multi-objective optimization algorithm. IEEE Transactions on Emerging Topics in Computing, 8(1), pp.20-30.

Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., Gschwandtner, P., Lemarinier, P., Markidis, S., Jordan, H. and Fahringer, T., 2018. A taxonomy of task-based parallel programming technologies for High Performance computing. The Journal of Supercomputing, 74(4), pp.1422-1434.

Ross, R.B., Amvrosiadis, G., Carns, P., Cranor, C.D., Dorier, M., Harms, K., Ganger, G., Gibson, G., Gutierrez, S.K., Latham, R. and Robey, B., 2020. Mochi: Composing data services for High Performance computing environments. Journal of Computer Science and Technology, 35, pp.121-144.

Hutchinson, M.S., 2020. Applying High Performance computing to early fusion video action recognition (Doctoral dissertation, Massachusetts Institute of Technology).

Goar, D. V. . (2021). Biometric Image Analysis in Enhancing Security Based on Cloud IOT Module in Classification Using Deep Learning- Techniques. Research Journal of Computer Systems and Engineering, 2(1), 01:05. Retrieved from https://technicaljournals.org/RJCSE/index.php/journal/article/view/9

Tosson, A., 2020. The way to a smarter community: exploring and exploiting data modeling, big data analytics, High Performance computing, and artificial intelligence techniques for applications of 2D energy-dispersive detectors in the crystallography community.

Li, C. and Zhao, Y., 2019. Traffic route optimization based on cloud computing parallel ACS. International Journal of Information and Communication Technology, 14(2), pp.204-217.

Kairi, A., Gagan, S., Bera, T. and Chakraborty, M., 2019. DNA Cryptography-Based Secured Weather Prediction Model in High Performance Computing. In Proceedings of International Ethical Hacking Conference 2018: eHaCON 2018, Kolkata, India (pp. 103-114). Springer Singapore.

Posey, B., Deer, A., Gorman, W., July, V., Kanhere, N., Speck, D., Wilson, B. and Apon, A., 2019, November. On-demand urgent High Performance computing utilizing the google cloud platform. In 2019 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC) (pp. 13-23). IEEE.

Catak, F.O. and Balaban, M.E., 2013. CloudSVM: training an SVM classifier in cloud computing systems. In Pervasive Computing and the Networked World: Joint International Conference, ICPCA/SWS 2012, Istanbul, Turkey, November 28-30, 2012, Revised Selected Papers (pp. 57-68). Springer Berlin Heidelberg.

Dogani, J., Khunjush, F., Mahmoudi, M.R. and Seydali, M., 2023. Multivariate workload and resource prediction in cloud computing using CNN and GRU by attention mechanism. The Journal of Supercomputing, 79(3), pp.3437-3470.

Arif, M., Ajesh, F., Shamsudheen, S. and Shahzad, M., 2022. Secure and Energy-Efficient Computational Offloading Using LSTM in Mobile Edge Computing. Security And Communication Networks, 2022, pp.1-13.

High Performance Computing (Hpc) in the Cloud: A Proactive Fault Tolerance (Pft) Strategy

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Announcements

Information for Authors

ijisae

Information

trindex