You have made significant investments in availability and disaster recovery – but your ability to recover hasn’t been tested in years. Testing will:
- Improve your DR capabilities.
- Identify required changes to planning documentation and procedures.
- Validate DR capabilities for interested customers and auditors.
Our Advice
Critical Insight
- If you treat testing as a pass/fail exercise, you aren’t meeting the end goal of improving organizational resilience.
- Focus on identifying gaps and risks, and addressing them, before a real disaster hits.
- Take a realistic, iterative approach to resilience testing that starts with small, low-risk tests and builds on lessons learned.
Impact and Result
- Identify testing scenarios and scope that can deliver value to your organization.
- Create practical test plans with Info-Tech’s template.
- Demonstrate value from testing to gain buy-in for additional tests.
Member Testimonials
After each Info-Tech experience, we ask our members to quantify the real-time savings, monetary impact, and project improvements our research helped them achieve. See our top member experiences for this blueprint and what our clients have to say.
9.0/10
Overall Impact
$32,499
Average $ Saved
5
Average Days Saved
Client
Experience
Impact
$ Saved
Days Saved
Boston Dynamics
Guided Implementation
9/10
$32,499
5
Take a Realistic Approach to Disaster Recovery Testing
Reduce costly downtime with a right-sized testing program that improves IT resilience.
Analyst Perspective
Reduce costly downtime with a right-sized testing program that improves IT resilience.
Most businesses make significant investments in disaster recovery and technology resilience. Redundant sites and systems, monitoring, intrusion prevention, backups, training, documentation: it all costs time and money.
But does this investment deliver expected value? Specifically, can you deliver service continuity in a way that meets business requirements?
You can’t know the answer without regularly testing recovery processes and systems. And more than just validation, testing helps you deliver service continuity by finding and addressing gaps in your plans and training your staff on recovery procedures.
Use the insights, tools, and templates in this research to create a streamlined and effective resilience testing program that helps validate recovery capabilities and enhance service reliability, availability, and continuity.
Andrew Sharp
Research Director, Infrastructure & Operations
Info-Tech Research Group
Executive Summary
Your ChallengeYou have made significant investments in availability and disaster recovery (DR) – but your ability to recover hasn’t been tested in years. Testing will:
|
Common ObstaclesDespite the value testing can offer, actually executing on DR tests is difficult because:
|
Info-Tech's ApproachTake a realistic approach to resilience testing by starting with small, low-risk tests, then iterating with the lessons you’ve learned:
|
Info-Tech Insight
If you treat testing as a pass/fail exercise, you aren’t meeting the end goal of improving organizational resilience. Focus on identifying gaps and risks so you can address them before a real disaster hits.
Process and Outputs
This research is accompanied by templates to help you achieve your goals faster.
1 - Establish the business rationale for DR testing.
2 - Review a range of options for testing.
3 - Prioritize tests that are most valuable to your business.
4 - Create a disaster recovery test plan.
5 - Establish a Test Program to support a regular testing cycle.
Outputs:
DR Test PlanDR Testing Program Summary
Orange activity slides like the one on the left provide directions to help you make key decisions.
Key Deliverable:
Disaster Recovery Test Plan Template
Build a plan for your first disaster recovery test.
This document provides a complete example you can use to quickly build your own plan, including goals, milestones, participants, the test-day schedule, and findings from the after-action review.
Why test?
Testing helps you avoid costly downtime
- In a disaster scenario, speed matters. Immediately after an outage, the impact on the organization is small, but impact increases rapidly the longer the outage continues.
- A quick and reliable response and recovery can protect the organization from significant losses.
- A DRP testing and maintenance program helps ensure you’re ready to recover when you need to, rather than figuring it out as you go.
“Routine testing is vital to survive a disaster… that’s when muscle memory sets in. If you don’t test your DR plan it falls [in importance], and you never see how routine changes impact it.”
– Jennifer Goshorn
Chief Administrative Officer
Gunderson Dettmer LLP
Info-Tech members estimated even one day of system downtime could lead to significant revenue losses.
Average estimated potential loss* in thousands of USD due to a 24-hour outage (N=41)
*Data aggregated from 41 business impact analyses (BIAs) conducted with Info-Tech advisory assistance. BIAs evaluate potential revenue loss due to a full day of system downtime, at the worst possible time.
Run tests to enhance disaster recovery plans
Testing improves organizational resilience
- Identify and address gaps in your plans before a real disaster strikes.
- Cross-train staff on systems recovery.
- Go beyond testing technology to test recovery processes.
- Establish a culture that centers resilience in everyday decision-making.
Testing keeps DR documentation ready for action
- Update documentation ahead of tests to prepare for the testing exercise.
- Update documentation after testing to incorporate any lessons learned.
Testing validates that investments in resilience deliver value
- Confirm your organization can meet defined recovery time objectives (RTOs) and recovery point objectives (RPOs).
- Provide proof of testing for auditors, prospective customers, and insurance applications
Overcome testing challenges
Despite the value of effective recovery testing, most IT organizations struggle to test recovery plans
Common challenges
- Key resources don’t have time for testing exercises.
- You don’t have the technology to support live recovery testing.
- Tests are done ad hoc and lessons learned are lost.
- A lack of business support for test exercises as the value isn’t understood.
- Tests are always artificially simple because RTOs and RPOs must be met to satisfy customer or auditor inquiries
Overcome challenges with a realistic approach:
- Start small with tabletop and recovery tests for specific systems.
- Include recovery tests in operational tasks (e.g. restore systems when you have a maintenance window).
- Create testing plans for larger testing exercises.
- Build on successful tests to streamline testing exercises in the future.
- Don’t make testing a pass-fail exercise. Focus on identifying gaps and risks so you can address them before a real disaster hits.
Go beyond traditional testing
Different test techniques help validate recovery against different threats
- There are many threats to service continuity, including ransomware, severe weather events, geopolitical conflict, legacy systems, staff turnover, and day-to-day outages caused by human error, software updates, hardware failures, or network outages.
- At its core, disaster recovery planning is about recovery. A plan for service recovery will help you mitigate against many threats at once. The testing approaches on the right will help you validate different aspects of that recovery process.
- This research will provide an overview of the approaches outlined on the right and help you prioritize tests that are most valuable to your organization.
00 Identify a working group
30 minutes
Identify a group of participants who can fill the following roles and inform the discussions around testing in this research. A single person could fill multiple roles and some roles could be filled by multiple people. Many participants will be drawn from the larger DRP team.
Input
|
Output
|
Participants
|
Start by updating your disaster recovery plan (DRP)
Use Info-Tech’s Create a Right-Sized Disaster Recovery Plan research to identify recovery objectives based on business impact and outline recovery processes. Both are tremendously valuable inputs to your test plans.
Overall Business Continuity Plan
IT Disaster Recovery PlanA plan to restore IT services (e.g. applications and infrastructure) following a disruption. A DRP:
|
BCP for Each Business UnitA set of plans to resume business processes for each business unit. A business continuity plan (BCP) is also sometimes called a continuity of operations plan (COOP). BCPs are created and owned by each business unit, and creating a BCP requires deep involvement from the leadership of each business unit. Info-Tech’s Develop a Business Continuity Plan blueprint provides a methodology for creating business unit BCPs as part of an overall BCP for the organization. |
Crisis Management PlanA plan to manage a wide range of crises, from health and safety incidents to business disruptions to reputational damage. Info-Tech’s Implement Crisis Management Best Practices blueprint provides a framework for planning a response to any crisis, from health and safety incidents to reputational damage. |
01 Confirm: why test at all?
15-30 minutes
Identify the value recovery testing for your organization. Use language appropriate for a nontechnical audience. Start with the list below and add, modify, or delete bullet points to reflect your own organization.
Drivers for testing – Examples:
- Improve service continuity.
- Identify and address gaps in recovery plans before a real disaster strikes.
- Cross-train staff on systems recovery to minimize single points of failure.
- Identify how we coordinate across teams during a major systems outage.
- Exercise both recovery processes and technology.
- Support a culture that centers system resilience in everyday decision-making.
- Keep recovery documentation up-to-date and ready for action.
- Confirm that our stated recovery objectives can be met.
- Provide proof of testing for auditors, prospective customers, and insurance applications.
- We require proof of testing to pass audits and renew cybersecurity insurance.
Info-Tech Insight
Time-strapped technical staff will sometimes push back on planning and testing, objecting that the team will “figure it out” in a disaster. But the question isn’t whether recovery is possible – it’s whether the recovery aligns with business needs. If your plan is to “MacGyver” a solution on the fly, you can’t know if it’s the right solution for your organization.
Input
| Output
| Participants
|
Think about what and how you test
Find gaps and risks with tabletop testing
In a tabletop planning exercise, the team walks through a disaster scenario to outline the recovery workflow, and risks or gaps that could disrupt that workflow.
Tabletops are particularly effective because:
- It enables you to play out a wider range of scenarios than technology-based testing (e.g. full-scale, parallel) due to cost and complexity factors.
- It is non-intrusive, so it can be executed more easily than other testing methodologies.
- The exercise translates into recovery documentation: you create a workflow as you go.
- A major site or service recovery scenario will review all aspects of the recovery process and create the backbone of your recovery plan.
02 Run a tabletop exercise
2 hours
Tabletop testing is part of our core DRP methodology, Create a Right-Sized Disaster Recovery Plan. This exercise can be run using cue cards, sticky notes, or on a whiteboard; many of our facilitators find building the workflow directly in flowchart software to be very effective.
Use our Recovery Workflow Template as a starting point.
Some tips for running your first tabletop exercise:
Do
|
Don't
|
-
Ahead of the exercise, decide on a scenario, identify participants, and book a meeting time.
- For your first walkthrough of a DR scenario, we often recommend a scenario that considers a site failure requiring failover to a DR site.
- For the first exercise, focus on technical aspects of recovery before bringing in members of the business. The technical team may need space to discuss the appropriate steps in the recovery process before you bring in business liaisons to discuss user acceptance testing (UAT).
- A complete failover considers all systems, the viability of your second site, and can help identify parts of the process that require additional exercises.
-
Review the scenario with participants. Then, discuss and document the recovery process, starting with initial
notification of an event.
- Record steps in the process on white cards or boxes.
- On yellow and red cards, document gaps and risks in people process and technology requirements.
-
Once you’ve walked through the process, return to the start.
- Record the time required to complete each step. Consider identifying who is responsible for key steps. Identify any additional gaps and risks.
- Clean up and record the results of the workflow. Save a copy with your DRP documentation.
Input
| Output
| Participants
|
Move from tabletop testing to functional exercises
See how your plans fare in the real world
In live exercises, some portion of your recovery plans are executed in a way that mimics a real recovery scenario. Some advantages of live testing:
- See how standby systems behave. A tabletop exercise can miss small issues that can make or break the recovery process. For example, connectivity or integration issues on a new subnet might be difficult to predict prior to actually running services in that environment.
- Hands-on practice: Familiarize the team with the steps, commands, and interfaces of your recovery toolset.
- Manage the pressure of the DR scenario: Nothing’s quite like the real thing, but a live exercise may be the closest your team can get to a disaster situation without experiencing it firsthand.
Examples of live exercises
Boot and smoke test | Turn on a standby system and confirm it boots up correctly. |
Restore and validate data | Restore data or servers from backup. Confirm data integrity. |
Parallel testing | Send familiar transactions to production and standby systems. Confirm both systems produce the same result. |
Failover systems | Shut down the production system and use the standby system in production. |
Run local tests ahead of releases
Think small
Most unacceptable downtime is caused by localized issues, such as hardware or software failures, rather than widespread destructive events. Regular local testing can help validate the recovery plan for local issues and improve overall service continuity.
Make local testing a standard step in maintenance work and new deployments to embed resilience considerations in day-to-day activities. Run the same tests in both your primary and your DR environment.
Some examples of localized tests:
- Review backup logs and check for errors.
- Restore files or whole systems from backup.
-
Run application-based tests as part of release management, including unit, regression, and performance tests.
- Ensure application tests are run for both the primary and DR environment.
- For a deep-dive on application testing, see Info-Tech’s research Automate Testing to Get More Done.
Info-Tech Insight
Local tests will vary between different services, and local test design is usually best left to the system SMEs. At the same time, centralize reporting to understand where tests are being done.
Investigate whether your IT Service Management or ticketing system can create recurring tasks or work orders to schedule, document, and track test exercises. Tasks can be pre-populated with checklists and documentation to support the test and provide a record of completed tests to support oversight and reporting.
Have the business validate recovery
If your business doesn’t think a system’s recovered, it’s not recovered.
User acceptance testing (UAT) after system recovery is a key step in the recovery process. Like any step in the process, there’s value in testing it before it actually needs to be done. Assign responsibility for building UATs to the person who will be responsible for executing them.
An acceptance test script might look something like the checklist below.
- Does the application open?
- Does the interface look right?
- Do you see any unusual notifications or warnings?
- Can you conduct a key transaction with dummy data?
- Can you run key reports?
“I cannot stress how important it is to assign ownership of responsibilities in a test; this is the only way to truly mitigate against issues in a test.”
– Robert Nardella
IT Service Management
Certified z/OS Mainframe Professional
Info-Tech Insight
Build test scripts and test transactions ahead of time to minimize the amount of new work required during a recovery scenario.
Beyond the Basics: Full Failover Testing
- A failover test – a full failover of your production environment to a secondary environment – is what many IT and businesspeople think about when they think of disaster recovery testing.
- A full test can validate previous local or tabletop tests, identify additional gaps and risks, and provide hands-on training experience with recovery processes and technologies.
- Setting a date for failover testing can also inject some urgency into otherwise low-priority (but high importance) disaster recovery planning and documentation exercises, which need to be completed prior to the test.
- Despite these benefits, full failover tests carry significant risk and require a great deal of effort and cost. Typically, only businesses that already have an active-active environment capable of supporting in-scope production systems are able to run a full environment failover.
- This is especially true the first time you test. While in theory a DR plan should be ready to go at any time, there will be documents to update, gaps to address, and risks to mitigate before you go ahead with the test.
Full Failover Testing
What you get:
- Provide hands-on experience with recovery processes and technology.
- Confirm that site failover works in practice as you assumed in tabletop or local testing exercises.
- Identify critical gaps you might have missed without a full failover test.
What you need:
- An active-active secondary site, with sufficient standby equipment, data, and licensed standby software to support production.
- A completed tabletop exercise and documented recovery workflow.
- A documented test plan, backout plan, and formal sign-off.
- An off-hours downtime window.
- Time from technical SMEs and business resources, both for creating the plan and executing the test.
Beyond the Basics: Site Reliability Engineering
- Site reliability engineering (SRE) is an application of skills and approaches from software engineering to improve system resilience.
- SRE is focused on “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning” across a set portfolio of services (Sloss, 2017).
- In many organizations, SRE is implemented as a team that supports separate applications teams.
- Applications must have defined and granular resilience requirements, translated into service objectives. The SRE team and applications teams will work together to meet these objectives.
- Site reliability engineers (the folks that do SRE, and often also abbreviated as SREs) are expected to build solutions and processes to ensure services remain stable and performant, not just respond when they fail. For example, Google allows their SREs to spend just half their time on incident response, with the rest of their time focused on development and automation tasks.
Site Reliability Testing
What you get:
- Improved reliability and reduced frequency and impact of downtime.
- Increased use of automation to address problems before they cause an incident.
- Granular resilience objectives.
What you need:
- Systems running on software-defined infrastructure.
- Specialized skills in programming, infrastructure-as-code.
- Business & product owners able to define and fund acceptable and appropriate resilience objectives.
- Technical experts able to translate product requirements into technical design requirements.