By Chris Williams
- “What are the questions I should be asking my organization to enable cloud disaster recovery?”
- “What is the process?”
- “How is it different from my on-premises to co-location disaster recovery strategy?”
- “Who is this Doctor Cloud and why are you writing so much about him?” (OK, that one is from my dad and I thought it was funny 😊 #DadJokes)
As a result of these lines of inquiry, I figured it would be a great time to write an article on the process that I go through with my customers and highlight some of the best practices, gotchas and pitfalls that I’ve come across along the way.
“What do we consider a disaster?” In order to tackle a project like this, you must first go through a few exercises (get it? Bootcamp? Exercise?). We need to define what it is your organization is trying to do. What is a disaster to your organization?
It’s not just the proverbial meteor strike. I’ve seen multiple answers to this question even within the same vertical! Here are a few examples:
- The website must NEVER GO DOWN
- Payroll must always go out every other Thursday no matter what
- Full data center outage
- Loss of email/ERP/CRM application suite
Before you’ve done anything else, it’s important to define what are the most important things to your organization in terms of “running the business.” If you don’t know this, you can’t solve for a disaster.
Benefits of the cloud
“Why should we want to move to the cloud?” There are a *TON* of benefits to moving your disaster recovery strategy to the cloud, but there can also be a few cons.
- Flexibility/Elasticity: The cloud allows you to create resources on the fly. If your on-premises data center grows or shrinks over the course of time, you can quickly adjust your cloud disaster recovery environment to reflect those changes. If you have a physical disaster recovery site, you could potentially have to purchase 2x equipment during a growth phase and have 2x too much hardware during a shrink phase.
- Cost Optimization: I can’t say this loud enough: When your cloud resources aren’t turned on, you aren’t paying for them. If you have an on-premises disaster recovery strategy, you’ve got the CapEx sunk cost into that site in addition to your main data center.
Now that we’ve gotten the big questions out of the way, we can get into the actual process. The first step is gathering the requirements from your business and technical stakeholders. We already know what is important to the business; now we need to know how that ties into the actual systems that we want to backup in the event of a disaster. The stakeholders will be:
- C-level executives
- Application owners
- Operational owners
Sometimes this group has been formalized in an org as the Cloud Center of Excellence. If one doesn’t formally exist, I always encourage my customers to create it. Having a CCoE to help your organization’s cloud strategy is extremely important! Now that I think about it … I should write another article on that.
When you have this group gathered together, you can answer the questions needed to create your disaster recovery strategy:
- What is important to the business?
- How does this relate to the existing infrastructure?
- What RPO/RTO/SLAs are currently in place and do they reflect what is actually important to the business?
Now that we’ve collected our requirements, we can prioritize our resources. By knowing what’s important to the business, we can create tiers for the different systems in the org. These tiers are defined by how quickly a resource needs to be running in the disaster recovery site once a disaster event has happened:
Tier 1 = Critical business applications that have the shortest RPO/RTO times based upon SLA.
Tier 2 = Important business applications that must come up secondary to Tier 1 applications.
Tier 3 = Business applications that must come up tertiary to Tier 1 applications.
Each organization will have a different tiering structure based upon their needs.
Application and process mapping
This is a tough part: now that we know what needs to be moved and in what priority, we now need to understand how it ties together with everything else. There are tools/teams that do this for a living because nobody (and I mean nobody) remembers all of the integrations of an application two years from its inception. 😊
This process is the step-by-step, by-the-numbers process for turning on the disaster recovery site. This runbook must be written in such a way that anyone can do it. The people that wrote the book might not be available during the disaster to help.
TEST, TEST, TEST! (and then TEST again!)
This part gets a LOT of my customers. You not only have to test, but you must test the entire environment at once. Doing a random file restore occasionally to prove reliability is important, but you also have to spin up the WHOLE THING to make sure that performance and stability are consistent in the new cloud location.
This is a (very) quick take on some of the considerations and steps one takes when thinking about a cloud disaster recovery strategy. Let me know if you have any questions or want to chat!
Chris Williams is a multi-cloud consultant and AWS hero who helps customers design and deploy the next generation of public, private and multi-cloud solutions.
Cloud disaster recovery: New trends, top questions
With changing times come new advancements for disaster recovery using cloud storage from Amazon Web Services (AWS), Microsoft Azure, or other cloud environments.