Principles of Fault Tolerance in IT Systems: A Review

International Journal of Computer Science and Engineering
© 2024 by SSRG - IJCSE Journal
Volume 11 Issue 4
Year of Publication : 2024
Authors : Iehab Abduljabbar Kamil, Mohanad A. Al-Askari

pdf
How to Cite?

Iehab Abduljabbar Kamil, Mohanad A. Al-Askari, "Principles of Fault Tolerance in IT Systems: A Review," SSRG International Journal of Computer Science and Engineering , vol. 11,  no. 4, pp. 1-9, 2024. Crossref, https://doi.org/10.14445/23488387/IJCSE-V11I4P101

Abstract:

This study aims to thoroughly analyse the fundamentals of fault tolerance systems in IT systems. This study aims to give a general understanding of the concepts involved in fault tolerance systems. This research examines fault tolerance concepts in IT systems regarding fault isolation and online repair. Examining fault tolerance in the context of IT systems and going over fault tolerance strategies in fault isolation, online repair, and fault isolation are the key goals of this research topic. A variety of fault-tolerant strategies are applied to increase fault tolerance. Among these, fault isolation is essential to creating a highly available system with fault tolerance. Load balancing and failover techniques are also employed with fault isolation approaches for high availability.

Keywords:

Fault Tolerance, Availability, Reliability, Scalability, IT Systems.

References:

[1] Shadi Attarha et al., “Virtualization Management Concept for Flexible and Fault-Tolerant Smart Grid Service Provision,” Energies, vol. 13, no. 9, pp. 1-16, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Han Bao, Tate Shorthill, and Hongbin Zhang, “Hazard Analysis for Identifying Common Cause Failures of Digital Safety Systems Using a Redundancy-Guided Systems-Theoretic Approach,” Annals of Nuclear Energy, vol. 148, pp. 1-22, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Tobias Distler, “Byzantine Fault-Tolerant State-Machine Replication from a Systems Perspective,” ACM Computing Surveys, vol. 54, no. 1, pp. 1-38, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Maria Bonaventura Forleo et al., “Analysing the Efficiency of Diversified Farms: Evidences From Italian FADN Data,” Journal of Rural Studies, vol. 82, pp. 262-270, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Ionel Gog, Michael Isard, and Martín Abadi, “Falkirk Wheel: Rollback Recovery for Dataflow Systems,” Proceedings of the ACM Symposium on Cloud Computing, pp. 373-387, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Geddam Kiran Kumar, and Devaraj Elangovan, “Review on Fault-Diagnosis and Fault-Tolerance for DC–DC Converters,” IET Power Electronics, vol. 13, no. 1, pp. 1-13, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Rakesh Kumar et al., “The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems,” 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Valencia, Spain, pp. 158-171, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Priti Kumari, and Parmeet Kaur, “A Survey of Fault Tolerance in Cloud Computing,” Journal of King Saud University-Computer and Information Sciences, vol. 33, no. 10, pp. 1159-1176, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[9] P. Dhivya Lakshmi, “Constructing Low-Density Parity-Check Codes in Digital Communication System,” ICTACT Journal on Communication Technology, vol. 11, no. 2, pp. 2198-2202, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Bastien Lecoeur, Hasan Mohsin, and Alastair F. Donaldson, “Program Reconditioning: Avoiding Undefined Behaviour When Finding and Reducing Compiler Bugs,” Proceedings of the ACM on Programming Languages, vol. 7, pp. 1801-1825, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Richard Li, Kiran Makhijani, and Lijun Dong, “New IP: A Data Packet Framework to Evolve the Internet,” IEEE 21st International Conference on High Performance Switching and Routing, pp. 1-8, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Ján Mach, Lukáš Kohútka, and Pavel Čičák, “On-Chip Bus Protection against Soft Errors,” Electronics, vol. 12, no. 22, pp. 1-16, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Omid Maghazei, Michael A. Lewis, and Torbjørn H. Netland, “Emerging Technologies and the Use Case: A Multi-Year Study of Drone Adoption,” Journal of Operations Management, vol. 68, no. 6-7, pp. 560-591, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Mohammad Mahdavi, and Ziawasch Abedjan, “Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 1948-1961, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Júlio Mendonça, Fumio Machida, and Marcus Völp, “Enhancing the Reliability of Perception Systems using N-version Programming and Rejuvenation,” 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Porto, Portugal, pp. 149-156, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Tayyab Muhammad et al., “AOptimizing Network Paths: In-Depth Analysis and Insights on Segment Routing,” Journal of Data Acquisition and Processing, vol. 38, no. 4, pp. 1942-1963, 2023.
[Google Scholar] [Publisher Link]
[17] Prakai Nadee, and Preecha Somwang, “Efficient Incremental Data Backup of Unison Synchronize Approach,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2707-2715, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[18] S. Sai Haree Ram, S. Uppala Durga Samrith, and V. Pandimurugan, “Decentralized Cloud Disaster Recovery using Consensus Algorithm,” 2 nd International Conference on Automation, Computing and Renewable Systems, Pudukkottai, India, pp. 923-930, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Sepideh Safari et al., “A Survey of Fault-Tolerance Techniques for Embedded Systems from the Perspective of Power, Energy, and Thermal Issues,” IEEE Access, vol. 10, pp. 12229-12251, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Zhibing Sha et al., “Proactive Stripe Reconstruction to Improve Cache Use Efficiency of SSD-Based RAID Systems,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1-18, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Nadir Subasi, Ufuk Guner, and Ilker Ustoglu, “N-Version Programming Approach with Implicit Safety Guarantee for Complex Dynamic System Stabilization Applications,” Measurement and Control, vol. 54, no. 3-4, pp. 269-278, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Ola Hani Fathi Sultan, and Turkan Ahmed Khaleel, “Challenges of Load Balancing Techniques in Cloud Environment: A Review,” Al-Rafidain Engineering Journal, vol. 27, no. 2, pp. 227-235, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Pejman Memar, “Simulation Setup for Investigating the Error Detection and Correction Codes Effectiveness Under Harsh Electromagnetic Interference,” Pan European Training Research and Education Network on Electromagnetic Risk Management, 2020.
[Publisher Link]
[24] Rongxi Wang et al., “Reliability Analysis of Complex Electromechanical Systems: State of the Art, Challenges, and Prospects,” Quality and Reliability Engineering International, vol. 38, no. 7, pp. 3935-3969, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Zhenyu Xu et al., “A Bus Authentication and Anti-Probing Architecture Extending Hardware Trusted Computing Base Off CPU Chips and Beyond,” ACM/IEEE 47th Annual International Symposium on Computer Architecture, Valencia, Spain, pp. 749-761, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Guojie Yang et al., “Interoperability and Data Storage in Internet of Multimedia Things: Investigating Current Trends, Research Challenges and Future Directions,” IEEE Access, vol. 8, pp.124382-124401, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Yi Zhang et al., “Dynamic Job Shop Scheduling Based on Deep Reinforcement Learning for Multi-Agent Manufacturing Systems,” Robotics and Computer-Integrated Manufacturing, vol. 78, 2022.
[CrossRef] [Google Scholar] [Publisher Link]