One of my challenges on nearly every project is the almost inevitable conversation about resilience. You see, the duet between high availability (HA) and disaster recovery (DR) is like modern jazz: complex, elusive and hard to define.
All customers understand the importance of their platform: It must be available very close to 100% of the time, and if a disaster occurs, it needs to be revived very quickly. What organisations may not have considered in any real detail is precisely how critical the platform is, what kinds of disasters are likely to strike, and how to recover from disaster.
Let’s get something straight from the start: The logistics of HA and DR have little to do with technology. Rather, they are driven more by business needs and operational requirements. If organisations start with this in mind and leave technology until later, everything will work out better in the end.
Let me lead you through the process.
Your first task when aiming for resilience is to examine your business needs and understand the impact of platform unavailability for various lengths of time: ten minutes, an hour, a day, a week or more. What will you lose operationally, contractually and monetarily in each scenario? What are other potential business impacts? Only when you understand these issues can you formulate a plan to ensure that your platform meets your needs.
What should emerge from this analysis are definitions of tolerable outage, data loss and recovery time. For instance, HA is typically expressed as a percentage, often 99% or higher, usually computed with a downtime calculator, which you can find online. (By the way, the data used to compute this percentage is probably more useful to the architect than the percentage—which is really just a convenient number to plug into presentations.)
Why is the data more useful than the percentage? Well, outages rear their ugly heads in many forms, and downtime duration varies. So if you have a 99.9% availability target, that translates to an outage of 8 hours, 45 minutes and 36 seconds each year, or 1 minute, 26 seconds per day. But those figures are quite useless, unless you can predict (and you cannot!) that you will have only one outage that won’t exceed 8:45:36.
That’s where scenario-building comes in very handy. Perhaps you anticipate a couple of complete outages and a few partial outages. For a partial outage, can you divide outage duration by the overall effect? In other words, does a one-hour outage affecting 10% of users translate to 6 minutes (10% of an hour)? Whatever data is available, allow for several scenarios. Previous experience of the solution or platform should help you do that; even for new systems, vendors like CA have data about reliability and potential threats to availability. What you should aim to end up with is a matrix of possible failures, with downtime attributed to each, and a recovery plan for each failure.
At that point, you have something you can apply to the technology and begin to build resilience.
So is that it? Actually, no. You may find that the technology doesn’t support what you need to achieve or that the cost is way beyond the benefit. Now it’s time for manipulating the possibilities to an availability percentage that is technically and financially viable but also meets business needs.
DR is a different beast altogether: The possibilities are endless and the cost/benefit curve can get very steep very quickly. My first question to customers is, “What disasters do you anticipate?” Predictably, a typical answer is “A plane hitting the building”—an unfortunately understandable, but not very useful, response. Nevertheless, the question provokes meaningful discussion.
DR is so much more than replicating data in a second data centre—a frequently stated goal that is often not the best or most effective answer. As with HA, look at the business case and identify the recovery time objective (RTO) and recovery point objective (RPO), two essential parameters of a disaster recovery plan (DRP).
Don’t immediately accept the first RTO and RPO handed to you: Make stakeholders justify them. Only then can you start to create a DRP. You may find that a second data centre is not necessary—it may be more effective to build from backups in a recovered data centre or a cloud environment. Having collateral sitting around doing nothing while waiting for a disaster can be expensive and is often unnecessary.
So, I hope the message is clear. Put away thoughts of technology and think about your business objectives first. That will likely give you an easier and cheaper route to resilience—and that would be music to your ears.
Inquiring minds want to know: What business objectives are presenting challenges to you now? Let us know!