An Evaluation of Major Fault Tolerance Techniques Used on High Performance Computing (HPC) Applications

Mirza Mohammed  Akram Baig

Authors

Mirza Mohammed Akram Baig

Keywords:

management, multiprocessor, checkpoints, software rejuvenation, terminated, tolerance

Abstract

High performance computing have a high number of constituent components used to facilitate data movement. Key characteristics of these systems include parallel processing, large memory, multiprocessor or multimode communication, and parallel file systems. Though they can turnaround computing in scenarios that need maximum processing power, HPCs face many challenges, key among them being fault tolerance. Today, most applications deal with faults by noting checkpoints frequently. Whenever a fault occurs, all the processes are terminated, and the task is loaded once again from the last checkpoint. Most applications deal with faults by noting checkpoints frequently. Whenever a fault occurs, all the processes are terminated, and the task is loaded once again from the last checkpoint. Key fault tolerance techniques used on HPC applications (reactive and proactive) were evaluated in this paper. Reactive protocols discussed include checkpointing/ restarting, replication, retry, and SGuard, while proactive techniques include preemptive migration, software rejuvenation, and self-healing strategy. As seen from the discussion on the drawbacks of each approach, efficient management of faults can best be achieved by using a hybrid system applying proactive and reactive measures simultaneously.

Downloads

Download data is not yet available.

Author Biography

Mirza Mohammed Akram Baig

Mirza Mohammed Akram Baig

Senior Member of Technical Staff, Illumio Inc

References

A. Osseyran and M. Giles, Eds., Industrial Applications of High-Performance Computing: Best Global Practices. New York: Chapman and Hall/CRC, 2015. doi: 10.1201/b18322.

J. Xie, Z. Chen, C. C. Douglas, W. Zhang, and Y. Chen, Eds., High performance computing and applications: Third International Conference, HPCA 2015 Shanghai, China, July 26-30, 2015 Revised Selected Papers. Springer, 2015.

W. Zhang, W. Tong, Z. Chen, and R. Glowinski, Eds., Current trends in high performance computing and its applications: proceedings of the International Conference on High Performance Computing and Applications, August 8-10, 2004, Shanghai, P.R. China. Berlin; New York: Springer, 2005. Accessed: Feb. 16, 2022. [Online]. Available: http://site.ebrary.com/id/10143448

M. A. Acuna and T. Aoki, “Real-time Tsunami simulation on multi-node GPU cluster,” ACMIEEE Conf. Supercomput., p. 2009.

M. G. Xavier, M. V. Neves, F. D. Rossi, T. C. Ferreto, T. Lange, and C. A. F. De Rose, “Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments,” in 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Feb. 2013, pp. 233–240. doi: 10.1109/PDP.2013.41.

X. Zhang, S. E. Wong, and F. C. Lightstone, “Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines,” J. Comput. Chem., vol. 34, no. 11, pp. 915–927, 2013, doi: 10.1002/jcc.23214.

I. P. Egwutuoha, S. Chen, D. Levy, and B. Selic, “A Fault Tolerance Framework for High Performance Computing in Cloud,” in 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), May 2012, pp. 709–710. doi: 10.1109/CCGrid.2012.80.

T. Herault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing. Cham: Springer International Publishing, 2015. Accessed: Feb. 16, 2022. [Online]. Available: https://link.springer.com/book/10.1007/978-3-319-20943-2

G. Gibson, B. Schroeder, and J. Digney, “Failure tolerance in petascale computers,” CTWatch Q., vol. 3, no. 4, Nov. 2007. A. Geist and C. Engelmann, “Development of naturally fault tolerant algorithms for computing on 100,000 processors.” Oak Ridge National Laboratory. Accessed: Feb. 16, 2022. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.8335&rep=rep1&type=pdf

J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, “The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI,” in 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007, pp. 1–8. doi: 10.1109/IPDPS.2007.370605.

S. Chakravorty and L. V. Kale, “A Fault Tolerance Protocol with Fast Fault Recovery,” in 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007, pp. 1–10. doi: 10.1109/IPDPS.2007.370310.

A. Gainaru and F. Cappello, “Errors and Faults,” in Fault-Tolerance Techniques for High-Performance Computing, T. Herault and Y. Robert, Eds. Cham: Springer International Publishing, 2015. Accessed: Feb. 16, 2022. [Online]. Available: https://link.springer.com/book/10.1007/978-3-319-20943-2

B. Schroeder and G. A. Gibson, “A Large-Scale Study of Failures in High-Performance Computing Systems,” IEEE Trans. Dependable Secure Comput., vol. 7, no. 4, pp. 337–350, Oct. 2010, doi: 10.1109/TDSC.2009.4.

A. Geist and D. A. Reed, “A survey of high-performance computing scaling challenges,” Int. J. High Perform. Comput. Appl., vol. 31, no. 1, pp. 104–113, Jan. 2017, doi: 10.1177/1094342015597083.

C. Engelmann and T. Naughton, “Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems,” in 2013 42nd International Conference on Parallel Processing, Oct. 2013, pp. 960–969. doi: 10.1109/ICPP.2013.114.

S. Chetan, A. Ranganathan, and R. Campbell, “Towards fault tolerance pervasive computing,” IEEE Technol. Soc. Mag., vol. 24, no. 1, pp. 38–44, 2005, doi: 10.1109/MTAS.2005.1407746.

A. Bala and I. Chana, “Fault tolerance - Challenges, techniques and implementation in cloud computing,” Int. J. Comput. Sci. Issues, vol. 9, no. 1, pp. 288–294, Jan. 2012.

A. Kumar and D. Malhotra, “Study of various reactive fault tolerance techniques in cloud computing,” Int. J. Comput. Sci. Eng., vol. 6, no. 5, Jun. 2018, [Online]. Available: https://www.ijcseonline.org/spl_pub_paper/IJCSE-ETACIT-2K18-010.pdf

P. K. Patra, H. Singh, and G. Singh, “Fault tolerance techniques and comparative implementation in cloud computing,” Int. J. Comput. Appl., vol. 64, no. 14, pp. 37–42, Feb. 2013.

G. R. Kalanirnika and V. M. Sivagami, “Fault tolerance in cloud using reactive and proactive techniques,” Int. J. Comput. Sci. Eng. Commun., vol. 3, no. 3, pp. 1159–1164, 2015.

G. Aupy, A. Benoit, M. E. M. Diouri, O. Gluck, and L. Lefevre, “Energy-aware checkpointing strategies,” in Fault-Tolerance Techniques for High-Performance Computing, T. Herault and Y. Robert, Eds. Cham: Springer International Publishing, 2015. Accessed: Feb. 16, 2022. [Online]. Available: https://link.springer.com/book/10.1007/978-3-319-20943-2

J. Dongarra, T. Herault, and Y. Robert, “Fault tolerance techniques for high-performance computing,” in Fault-Tolerance Techniques for High-Performance Computing, T. Herault and Y. Robert, Eds. Cham: Springer International Publishing, 2015. Accessed: Feb. 16, 2022. [Online]. Available: https://link.springer.com/book/10.1007/978-3-319-20943-2

P. H. Hargrove and J. C. Duell, “Berkeley Lab Checkpoint/ Restart (BLCR) for Linux Clusters.” Ernest Orlando Larence Berkeley National Laboratory, 2006.

H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi, “CheCUDA: A Checkpoint/Restart Tool for CUDA Applications,” in 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, Dec. 2009, pp. 408–413. doi: 10.1109/PDCAT.2009.78.

G. Rodríguez, M. J. Martín, P. González, J. Touriño, and R. Doallo, “CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications,” Concurr. Comput. Pract. Exp., vol. 22, no. 6, pp. 749–766, 2010, doi: 10.1002/cpe.1541.

C.-C. J. Li, E. M. Stewart, and W. K. Fuchs, “Compiler-assisted full checkpointing,” Softw. Pract. Exp., vol. 24, no. 10, pp. 871–886, 1994, doi: 10.1002/spe.4380241002.

K. Sato et al., “A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers,” in 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2014, pp. 21–30. doi: 10.1109/CCGrid.2014.24.

J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa, and S. Jiang, “Current practice and a direction forward in checkpoint/restart implementations for fault tolerance,” in 19th IEEE International Parallel and Distributed Processing Symposium, Apr. 2005, p. 8 pp.-. doi: 10.1109/IPDPS.2005.157.

G. Cao and M. Singhal, “On coordinated checkpointing in distributed systems,” IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 12, pp. 1213–1225, Dec. 1998, doi: 10.1109/71.737697.

L. Wang et al., “Modeling coordinated checkpointing for large-scale supercomputers,” in 2005 International Conference on Dependable Systems and Networks (DSN’05), Jun. 2005, pp. 812–821. doi: 10.1109/DSN.2005.67.

N. Neves and W. K. Fuchs, “Coordinated checkpointing without direct coordination,” in Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS’98 (Cat. No.98TB100248), Sep. 1998, pp. 23–31. doi: 10.1109/IPDS.1998.707706.

R. E. Strom, D. F. Bacon, and S. A. Yemini, “Volatile logging in n-fault-tolerant distributed systems,” in [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers, Jun. 1988, pp. 44–49. doi: 10.1109/FTCS.1988.5295.

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, “Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications,” in 2011 IEEE International Parallel Distributed Processing Symposium, May 2011, pp. 989–1000. doi: 10.1109/IPDPS.2011.95.

E. N. (Mootaz) Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, “A survey of rollback-recovery protocols in message-passing systems,” ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, 2002.

R. A. Oldfield et al., “Modeling the Impact of Checkpoints on Next-Generation Systems,” in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), San Diego, CA, USA, Sep. 2007, pp. 30–46. doi: https://doi.org/10.1109/MSST.2007.4367962.

Y.-M. Wang, P.-Y. Chung, I.-J. Lin, and W. K. Fuchs, “Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 6, no. 5, pp. 546–554, May 1995, doi: 10.1109/71.382324.

A. Mostefaoui and M. Raynal, “Efficient message logging for uncoordinated checkpointing protocols,” in Dependable Computing — EDCC-2, Berlin, Heidelberg, 1996, pp. 353–364. doi: 10.1007/3-540-61772-8_48.

H. S. Paul, A. Gupta, and R. Badrinath, “Hierarchical cordinated checkpointing protocol.” Indian Institute of Technology. Accessed: Feb. 16, 2022. [Online]. Available: https://www.angelfire.com/linux/badri/papers/PDCS-hier.pdf

P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan, “Replication-Based Fault-Tolerance for Large-Scale Graph Processing,” in 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun. 2014, pp. 562–573. doi: 10.1109/DSN.2014.58.

J. P. Walters and V. Chaudhary, “Replication-Based Fault Tolerance for MPI Applications,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 7, pp. 997–1010, Jul. 2009, doi: 10.1109/TPDS.2008.172.

R. Guerraoui and A. Schiper, “Software-based replication for fault tolerance,” Computer, vol. 30, no. 4, pp. 68–74, Apr. 1997, doi: 10.1109/2.585156.

E. B. Tchernev, R. G. Mulvaney, and D. S. Phatak, “Investigating the Fault Tolerance of Neural Networks,” Neural Comput., vol. 17, no. 7, pp. 1646–1664, Jul. 2005, doi: 10.1162/0899766053723096.

A. Rajalakshmi, D. Vijayakumar, and K. G. Srinivasagan, “An improved dynamic data replica selection and placement in cloud,” in 2014 International Conference on Recent Trends in Information Technology, Apr. 2014, pp. 1–6. doi: 10.1109/ICRTIT.2014.6996180.

M. Chtepen, F. H. A. Claeys, B. Dhoedt, F. De Turck, P. Demeester, and P. A. Vanrolleghem, “Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 2, pp. 180–190, Feb. 2009, doi: 10.1109/TPDS.2008.93.

A. M. Saleh and J. H. Patel, “Transient-fault analysis for retry techniques,” IEEE Trans. Reliab., vol. 37, no. 3, pp. 323–330, Aug. 1988, doi: 10.1109/24.3763.

J. Sosnowski, “Transient fault tolerance in digital systems,” IEEE Micro, vol. 14, no. 1, pp. 24–35, Feb. 1994, doi: 10.1109/40.259897.

Y. Huang, P. Jalote, and C. Kintala, “Two techniques for transient software error recovery,” in Hardware and Software Architectures for Fault Tolerance, Berlin, Heidelberg, 1994, pp. 159–170. doi: 10.1007/BFb0020031.

Y. Kwon, M. Balazinska, and A. Greenberg, “Fault-tolerant system processing using a distributed, replicated file system,” Proc VLDB Endow., vol. 1, no. 1, pp. 574–585, Aug. 2008.

M. A. Mukwevho and T. Celik, “Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems,” IEEE Trans. Serv. Comput., vol. 14, no. 2, pp. 589–605, Mar. 2021, doi: 10.1109/TSC.2018.2816644.

G. Vallee et al., “A Framework for Proactive Fault Tolerance,” in 2008 Third International Conference on Availability, Reliability and Security, Mar. 2008, pp. 659–664. doi: 10.1109/ARES.2008.171.

J. Liu, S. Wang, A. Zhou, S. A. P. Kumar, F. Yang, and R. Buyya, “Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability,” IEEE Trans. Cloud Comput., vol. 6, no. 4, pp. 1191–1202, Oct. 2018, doi: 10.1109/TCC.2016.2567392.

S. Chakravorty, C. L. Mendes, and L. V. Kale, “Proactive fault tolerance in large systems”, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.644.7952&rep=rep1&type=pdf

J. Liu, J. Zhou, and R. Buyya, “Software Rejuvenation Based Fault Tolerance Scheme for Cloud Applications,” in 2015 IEEE 8th International Conference on Cloud Computing, Jun. 2015, pp. 1115–1118. doi: 10.1109/CLOUD.2015.164.

D. Bruneo, S. Distefano, F. Longo, A. Puliafito, and M. Scarpa, “Workload-Based Software Rejuvenation in Cloud Systems,” IEEE Trans. Comput., vol. 62, no. 6, pp. 1072–1085, Jun. 2013, doi: 10.1109/TC.2013.30.

D. Cotroneo, R. Natella, R. Pietrantuono, and S. Russo, “Software Aging and Rejuvenation: Where We Are and Where We Are Going,” in 2011 IEEE Third International Workshop on Software Aging and Rejuvenation, Nov. 2011, pp. 1–6. doi: 10.1109/WoSAR.2011.15.

M. Melo, J. Araujo, R. Matos, J. Menezes, and P. Maciel, “Comparative Analysis of Migration-Based Rejuvenation Schedules on Cloud Availability,” in 2013 IEEE International Conference on Systems, Man, and Cybernetics, Oct. 2013, pp. 4110–4115. doi: 10.1109/SMC.2013.701.

M. Grottke, R. Matias, and K. S. Trivedi, “The fundamentals of software aging,” in 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp), Nov. 2008, pp. 1–6. doi: 10.1109/ISSREW.2008.5355512.

R. Matias and P. J. F. Filho, “An Experimental Study on Software Aging and Rejuvenation in Web Servers,” in 30th Annual International Computer Software and Applications Conference (COMPSAC’06), Sep. 2006, vol. 1, pp. 189–196. doi: 10.1109/COMPSAC.2006.25.

T. Thein, S.-D. Chi, and J. S. Park, “Improving Fault Tolerance by Virtualization and Software Rejuvenation,” in 2008 Second Asia International Conference on Modelling Simulation (AMS), May 2008, pp. 855–860. doi: 10.1109/AMS.2008.75.

Y. Huang, C. M. R. Kintala, L. Bernstein, and Y.-M. Wang, “Components for software fault tolerance and rejuvenation,” T Tech. J., vol. 75, no. 2, pp. 29–37, Mar. 1996, doi: 10.15325/ATTTJ.1996.6771126.

J. Araujo, R. Matos, P. Maciel, F. Vieira, R. Matias, and K. S. Trivedi, “Software Rejuvenation in Eucalyptus Cloud Computing Infrastructure: A Method Based on Time Series Forecasting and Multiple Thresholds,” in 2011 IEEE Third International Workshop on Software Aging and Rejuvenation, Nov. 2011, pp. 38–43. doi: 10.1109/WoSAR.2011.18.

K. Vaidyanathan and K. S. Trivedi, “A comprehensive model for software rejuvenation,” IEEE Trans. Dependable Secure Comput., vol. 2, no. 2, pp. 124–137, Apr. 2005, doi: 10.1109/TDSC.2005.15.

A. Pfening, S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi, “Optimal software rejuvenation for tolerating soft failures,” Perform. Eval., vol. 27–28, pp. 491–506, Oct. 1996, doi: 10.1016/S0166-5316(96)90042-5.

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, “Checkpointing Strategies with Prediction Windows,” in 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, Dec. 2013, pp. 1–10. doi: 10.1109/PRDC.2013.9.

D. Kochhar, A. Kumar, and J. Hilda, “An approach for fault tolerance in cloud computing using machine learning technique,” Int. J. Pure Appl. Math., vol. 117, no. 22, pp. 345–351, 2017.

C. Engelmann, G. R. Vallee, T. Naughton, and S. L. Scott, “Proactive Fault Tolerance Using Preemptive Migration,” in 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing, Feb. 2009, pp. 252–257. doi: 10.1109/PDP.2009.31.

S. Prathiba and S. Sowvarnica, “Survey of failures and fault tolerance in cloud,” in 2017 2nd International Conference on Computing and Communications Technologies (ICCCT), Feb. 2017, pp. 169–172. doi: 10.1109/ICCCT2.2017.7972271.

A. Polze, P. Tröger, and F. Salfner, “Timely Virtual Machine Migration for Pro-active Fault Tolerance,” in 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops, Mar. 2011, pp. 234–243. doi: 10.1109/ISORCW.2011.42.

P. D. Kaur and K. Priya, “Fault tolerance techniques and architectures in cloud computing - a comparative analysis,” in 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), Oct. 2015, pp. 1090–1095. doi: 10.1109/ICGCIoT.2015.7380625.

A. Ganesh, M. Sandhya, and S. Shankar, “A study on fault tolerance methods in Cloud Computing,” in 2014 IEEE International Advance Computing Conference (IACC), Feb. 2014, pp. 844–849. doi: 10.1109/IAdCC.2014.6779432.

A. Ledmi, H. Bendjenna, and S. M. Hemam, “Fault Tolerance in Distributed Systems: A Survey,” in 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), Oct. 2018, pp. 1–5. doi: 10.1109/PAIS.2018.8598484.

S. L. Scott et al., “A tunable holistic resiliency approach for high-performance computing systems,” in Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, New York, NY, USA, Feb. 2009, pp. 305–306. doi: 10.1145/1504176.1504227.

T. Tamilvizhi and B. Parvathavarthini, “A novel method for adaptive fault tolerance during load balancing in cloud computing,” Clust. Comput., vol. 22, no. 5, pp. 10425–10438, Sep. 2019, doi: 10.1007/s10586-017-1038-6.

M. Hasan and M. S. Goraya, “Fault tolerance in cloud computing environment: A systematic survey,” Comput. Ind., vol. 99, pp. 156–172, Aug. 2018, doi: 10.1016/j.compind.2018.03.027.

M. Nazari Cheraghlou, A. Khadem-Zadeh, and M. Haghparast, “A survey of fault tolerance architecture in cloud computing,” J. Netw. Comput. Appl., vol. 61, pp. 81–92, Feb. 2016, doi: 10.1016/j.jnca.2015.10.004.

E. AbdElfattah, M. Elkawkagy, and A. El-Sisi, “A reactive fault tolerance approach for cloud computing,” in 2017 13th International Computer Engineering Conference (ICENCO), Dec. 2017, pp. 190–194. doi: 10.1109/ICENCO.2017.8289786.

L. Guan, H. Chen, and L. Lin, “A Multi-Agent-Based Self-Healing Framework Considering Fault Tolerance and Automatic Restoration for Distribution Networks,” IEEE Access, vol. 9, pp. 21522–21531, 2021, doi: 10.1109/ACCESS.2021.3055284.

J. Nikolić, N. Jubatyrov, and E. Pournaras, “Self-Healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance,” IEEE Trans. Netw. Serv. Manag., vol. 18, no. 3, pp. 2728–2741, Sep. 2021, doi: 10.1109/TNSM.2021.3092939.

R. Frei, R. McWilliam, B. Derrick, A. Purvis, A. Tiwari, and G. Di Marzo Serugendo, “Self-healing and self-repairing technologies,” Int. J. Adv. Manuf. Technol., vol. 69, no. 5, pp. 1033–1061, Nov. 2013, doi: 10.1007/s00170-013-5070-2.

R. Salvador, A. Otero, J. Mora, E. de la Torre, L. Sekanina, and T. Riesgo, “Fault Tolerance Analysis and Self-Healing Strategy of Autonomous, Evolvable Hardware Systems,” in 2011 International Conference on Reconfigurable Computing and FPGAs, Nov. 2011, pp. 164–169. doi: 10.1109/ReConFig.2011.37.

B. Navas, J. Öberg, and I. Sander, “The upset-fault-observer: A concept for self-healing adaptive fault tolerance,” in 2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Jul. 2014, pp. 89–96. doi: 10.1109/AHS.2014.6880163.

B. Navas, J. Öberg, and I. Sander, “On providing scalable self-healing adaptive fault-tolerance to RTR SoCs,” in 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14), Dec. 2014, pp. 1–6. doi: 10.1109/ReConFig.2014.7032541.

An Evaluation of Major Fault Tolerance Techniques Used on High Performance Computing (HPC) Applications

Authors

Keywords:

Abstract

Downloads

Author Biography

Mirza Mohammed Akram Baig

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By

An Evaluation of Major Fault Tolerance Techniques Used on High Performance Computing (HPC) Applications

Authors

Keywords:

Abstract

Downloads

Author Biography

Mirza Mohammed Akram Baig

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By