Save to My Lists Export

Book Cover

E-book

Author

Title Beyond redundancy : how geographic redundancy can improve service availability and reliability of computer-based systems / Eric Bauer, Randee Adams, Daniel Eustace

Published Hoboken, NJ : Wiley-IEEE Press, [2012]

©2012

Click on the following:

IEEE eBooks

Copies

Description 1 online resource (xxvi, 304 pages) : illustrations

Contents 880-01 pt. 1. Basics -- Service, Risk, and Business Continuity -- Service Availability and Service Reliability -- pt. 2. Modeling and Analysis of Redundancy -- Understanding Redundancy -- Overview of External Redundancy -- External Redundancy Strategy Options -- Modeling Service Availability with External System Redundancy -- Understanding Recovery Timing Parameters -- Case Study of Client-Initiated Recovery -- Solution and Cluster Recovery -- pt. 3. Recommendations -- Georedundancy Strategy -- Maximizing Service Availability via Georedundancy -- Georedundancy Requirements -- Georedundancy Testing -- Solution Georedundancy Case Study -- Summary -- Appendix: Markov Modeling of Service Availability -- Acronyms -- References -- About the Authors -- Index

880-01/(S Machine generated contents note: pt. 1 BASICS -- 1. Service, Risk, And Business Continuity -- 1.1. Service Criticality and Availability Expectations -- 1.2. Eight-Ingredient Model -- 1.3. Catastrophic Failures and Geographic Redundancy -- 1.4. Geographically Separated Recovery Site -- 1.5. Managing Risk -- 1.5.1. Risk Identification -- 1.5.2. Risk Treatments -- 1.6. Business Continuity Planning -- 1.7. Disaster Recovery Planning -- 1.8. Human Factors -- 1.9. Recovery Objectives -- 1.10. Disaster Recovery Strategies -- 2. Service Availability And Service Reliability -- 2.1. Availability and Reliability -- 2.1.1. Service Availability -- 2.1.2. Service Reliability -- 2.1.3. Reliability, Availability, and Failures -- 2.2. Measuring Service Availability -- 2.2.1. Total and Partial Outages -- 2.2.2. Minimum Chargeable Disruption Duration -- 2.2.3. Outage Attributability -- 2.2.4. Systems and Network Elements -- 2.2.5. Service Impact and Element Impact Outages -- 2.2.6. Treatment of Planned Events -- 2.3. Measuring Service Reliability -- pt. 2 MODELING AND ANALYSIS OF REDUNDANCY -- 3. Understanding Redundancy -- 3.1. Types of Redundancy -- 3.1.1. Simplex Configuration -- 3.1.2. Redundancy -- 3.1.3. Single Point of Failure -- 3.2. Modeling Availability of Internal Redundancy -- 3.2.1. Modeling Active-Active Redundancy -- 3.2.2. Modeling Active Standby Redundancy -- 3.2.3. Service Availability Comparison -- 3.3. Evaluating High-Availability Mechanisms -- 3.3.1. Recovery Time Objective (or Nominal Outage Duration) -- 3.3.2. Recovery Point Objective -- 3.3.3. Nominal Success Probability -- 3.3.4. Capital Expense -- 3.3.5. Operating Expense -- 3.3.6. Discussion -- 4. Overview Of External Redundancy -- 4.1. Generic External Redundancy Model -- 4.1.1. Failure Detection -- 4.1.2. Triggering Recovery Action -- 4.1.3. Traffic Redirection -- 4.1.4. Service Context Preservation -- 4.1.5. Graceful Service Migration -- 4.2. Technical Distinctions between Georedundancy and Co-Located Redundancy -- 4.3. Manual Graceful Switchover and Switchback -- 5. External Redundancy Strategy Options -- 5.1. Redundancy Strategies -- 5.2. Data Recovery Strategies -- 5.3. External Recovery Strategies -- 5.4. Manually Controlled Recovery -- 5.4.1. Manually Controlled Example: Provisioning System for a Database -- 5.4.2. Manually Controlled Example: Performance Management Systems -- 5.5. System-Driven Recovery -- 5.5.1. System-Driven Recovery Examples -- 5.6. Client-Initiated Recovery -- 5.6.1. Client-Initiated Recovery Overview -- 5.6.2. Failure Detection by Client -- 5.6.3. Client-Initiated Recovery Example: Automatic Teller Machine (ATM) -- 5.6.4. Client-Initiated Recovery Example: A Web Browser Querying a Web Server -- 5.6.5. Client-Initiated Recovery Example: A Pool of DNS Servers -- 6. Modeling Service Availability With External System Redundancy -- 6.1. Simplistic Answer -- 6.2. Framing Service Availability of Standalone Systems -- 6.3. Generic Markov Availability Model of Georedundant Recovery -- 6.3.1. Simplifying Assumptions -- 6.3.2. Standalone High-Availability Model -- 6.3.3. Manually Controlled Georedundant Recovery -- 6.3.4. System-Driven Georedundant Recovery -- 6.3.5. Client-Initiated Georedundancy Recovery -- 6.3.6. Complex Georedundant Recovery -- 6.3.7. Comparing the Generic Georedundancy Model to the Simplistic Model -- 6.4. Solving the Generic Georedundancy Model -- 6.4.1. Manually Controlled Georedundant Recovery Model -- 6.4.2. System-Driven Georedundant Recovery Model -- 6.4.3. Client-Initiated Georedundant Recovery Model -- 6.4.4. Conclusion -- 6.5. Practical Modeling of Georedundancy -- 6.5.1. Practical Modeling of Manually Controlled External System Recovery -- 6.5.2. Practical Modeling of System-Driven Georedundant Recovery -- 6.5.3. Practical Modeling of Client-Initiated Recovery -- 6.6. Estimating Availability Benefit for Planned Activities -- 6.7. Estimating Availability Benefit for Disasters -- 7. Understanding Recovery Timing Parameters -- 7.1. Detecting Implicit Failures -- 7.1.1. Understanding and Optimizing Ttimeout -- 7.1.2. Understanding and Optimizing Tkeepalive -- 7.1.3. Understanding and Optimizing Tclient -- 7.1.4. Timer Impact on Service Reliability -- 7.2. Understanding and Optimizing RTO -- 7.2.1. RTO for Manually Controlled Recovery -- 7.2.2. RTO for System-Driven Recovery -- 7.2.3. RTO for Client-Initiated Recovery -- 7.2.4. Comparing External Redundancy Strategies -- 8. Case Study Of Client-Initiated Recovery -- 8.1. Overview of DNS -- 8.2. Mapping DNS onto Practical Client-Initiated Recovery Model -- 8.2.1. Modeling Normal Operation -- 8.2.2. Modeling Server Failure -- 8.2.3. Modeling Timeout Failure -- 8.2.4. Modeling Abnormal Server Failure -- 8.2.5. Modeling Multiple Server Failure -- 8.3. Estimating Input Parameters -- 8.3.1. Server Failure Rate -- 8.3.2. Fexplicit Parameter -- 8.3.3. μclient Parameter -- 8.3.4. μtimeout Parameter -- 8.3.5. μclientsfd Parameter -- 8.3.6. μclient Parameter -- 8.3.7. Acluster-1 Parameter -- 8.3.8. μclient Parameter -- 8.3.9. μgrecover and μmigration Parameters -- 8.3.10. μdouplex Parameter -- 8.3.11. Parameter Summary -- 8.4. Predicted Results -- 8.4.1. Sensitivity Analysis -- 8.5. Discussion of Predicted Results -- 9. Solution And Cluster Recovery -- 9.1. Understanding Solutions -- 9.1.1. Solution Users -- 9.1.2. Solution Architecture -- 9.2. Estimating Solution Availability -- 9.3. Cluster versus Element Recovery -- 9.4. Element Failure and Cluster Recovery Case Study -- 9.5. Comparing Element and Cluster Recovery -- 9.5.1. Failure Detection -- 9.5.2. Triggering Recovery Action -- 9.5.3. Traffic Redirection -- 9.5.4. Service Context Preservation -- 9.5.5. Graceful Migration -- 9.6. Modeling Cluster Recovery -- 9.6.1. Cluster Recovery Modeling Parameters -- 9.6.2. Estimating λsuperelement -- 9.6.3. Example of Super Element Recovery Modeling -- pt. 3 RECOMMENDATIONS -- 10. Georedundancy Strategy -- 10.1. Why Support Multiple Sites-- 10.2. Recovery Realms -- 10.2.1. Choosing Site Locations -- 10.3. Recovery Strategies -- 10.4. Limp-Along Architectures -- 10.5. Site Redundancy Options -- 10.5.1. Standby Sites -- 10.5.2. N + K Load Sharing -- 10.5.3. Discussion -- 10.6. Virtualization, Cloud Computing, and Standby Sites -- 10.7. Recommended Design Methodology -- 11. Maximizing Service Availability Via Georedundancy -- 11.1. Theoretically Optimal External Redundancy -- 11.2. Practically Optimal Recovery Strategies -- 11.2.1. Internal versus External Redundancy -- 11.2.2. Client-Initiated Recovery as Optimal External Recovery Strategy -- 11.2.3. Multi-Site Strategy -- 11.2.4. Active-Active Server Operation -- 11.2.5. Optimizing Timeout and Retry Parameters -- 11.2.6. Rapid Relogon -- 11.2.7. Rapid Context Restoration -- 11.2.8. Automatic Switchback -- 11.2.9. Overload Control -- 11.2.10. Network Element versus Cluster-Level Recovery -- 11.3. Other Considerations -- 11.3.1. Architecting to Facilitate Planned Maintenance Activities -- 11.3.2. Procedural Considerations -- 12. Georedundancy Requirements -- 12.1. Internal Redundancy Requirements -- 12.1.1. Standalone Network Element Redundancy Requirements -- 12.1.2. Basic Solution Redundancy Requirements -- 12.2. External Redundancy Requirements -- 12.3. Manually Controlled Redundancy Requirements -- 12.3.1. Manual Failover Requirements -- 12.3.2. Graceful Switchover Requirements -- 12.3.3. Switchback Requirements -- 12.4. Automatic External Recovery Requirements -- 12.4.1. System-Driven Recovery -- 12.4.2. Client-Initiated Recovery -- 12.5. Operational Requirements -- 13. Georedundancy Testing -- 13.1. Georedundancy Testing Strategy -- 13.1.1. Network Element Level Testing -- 13.1.2. End-to-End Testing -- 13.1.3. Deployment Testing -- 13.1.4. Operational Testing -- 13.2. Test Cases for External Redundancy -- 13.3. Verifying Georedundancy Requirements -- 13.3.1. Test Cases for Standalone Elements -- 13.3.2. Test Cases for Manually Controlled Recovery -- 13.3.3. Test Cases for System-Driven Recovery -- 13.3.4. Test Cases for Client-Initiated Recovery -- 13.3.5. Test Cases at the Solution Level -- 13.3.6. Test cases for Operational Testing -- 13.4. Summary -- 14. Solution Georedundancy Case Study -- 14.1. Hypothetical Solution -- 14.1.1. Key Quality Indicators -- 14.2. Standalone Solution Analysis -- 14.2.1. Construct Reliability Block Diagrams -- 14.2.2. Network Element Configuration in Standalone Solution -- 14.2.3. Service Availability Offered by Standalone Solution -- 14.2.4. Discussion of Standalone Solution Analysis -- 14.3. Georedundant Solution Analysis -- 14.3.1. Identify Factors Constraining Recovery Realm Design -- 14.3.2. Define Recovery Realms -- 14.3.3. Define Recovery Strategies -- 14.3.4. Set Recovery Objectives -- 14.3.5. Architecting Site Redundancy -- 14.4. Availability of the Georedundant Solution -- 14.5. Requirements of Hypothetical Solution -- 14.5.1. Internal Redundancy Requirements -- 14.5.2. External Redundancy Requirements -- 14.5.3. Manual Failover Requirements -- 14.5.4. Automatic External Recovery Requirements -- 14.5.5. Operational Requirements -- 14.6. Testing of Hypothetical Solution -- 14.6.1. Testing Strategy -- 14.6.2. Standalone Network Element Testing

Summary "While geographic redundancy can obviously be a huge benefit for disaster recovery, it is far less obvious what benefit is feasible and likely for more typical non-catastrophic hardware, software, and human failures. Georedundancy and Service Availability provides both a theoretical and practical treatment of the feasible and likely benefits of geographic redundancy for both service availability and service reliability. The text provides network/system planners, IS/IT operations folks, system architects, system engineers, developers, testers, and other industry practitioners with a general discussion about the capital expense/operating expense tradeoff that frames system redundancy and georedundancy"-- Provided by publisher

"This book provides both a theoretical and practical treatment of the feasible and likely benefits of geographic redundancy for both service availability and service reliability"-- Provided by publisher

Notes Title from home page (viewed Dec. 16, 2011)

Bibliography Includes bibliographical references (pages 298-299) and index

Notes Print version record

Subject Computer input-output equipment -- Reliability

Computer networks -- Reliability

Redundancy (Engineering)

COMPUTERS -- Software Development & Engineering -- General.

COMPUTERS -- Data Transmission Systems -- General.

COMPUTERS -- Networking -- Vendor Specific.

Computer networks -- Reliability

Redundancy (Engineering)

Form Electronic book

Author Adams, Randee.

Eustace, Dan.

ISBN 9781118104910

1118104919

9781118104927

1118104927

9781118104934

1118104935

1283282747

9781283282741

Permalink