Disaster Recovery strategies in Cloud or in general

Prasanth Kanikicherla
4 min readJun 17, 2020

These probably can be used for on premises solutions, but, I am more focusing with cloud examples in this blog.

Introduction

As part of my preparation for certain certifications, I stumbled across few options that are specified for disaster recovery. These can be tweaked to apply to any application or any public cloud for that matter. Thinking about it, it’s even applicable for any kubernetes environment, technically, after all; it’s just another PaaS.

Especially, liked going through the AWS DR strategy white paper located here → https://d1.awsstatic.com/whitepapers/aws-disaster-recovery.pdf

DR Strategies mentioned in AWS whitepaper

The main deciding factors for strategizing a DR plan would be RPO and RTO. Now, what are these ?? .. Well, they stand for Recovery Point Objective and Recovery Time Objective. In simple terms, RPO suggests how much data loss is accepted by business teams (measured in time) and RTO, suggests how fast business wants the application to be “up and running” after disaster occurs.

Both RTO & RPO can be in mins or hours, depending on the criticality of the application to the business. Based on these numbers we can decide for a strategy; among suggested approaches below.

Strategies

Backup and Restore:

Simplest of all and probably the most cost effective approach, among all other options. We need to take backups of the system regularly, like copy of latest code in application , db backups, data backups, input files backed up etc., basically a copy of everything that is needed for restoring ..

For this approach to work, we need both RPO/RTO to be in hours, because we will be restoring a new replica of the application from the backups that we take. Also some times it might takes hours just to get the backups, if they are stored on tape or moved to glacier (AWS) etc.,

It’s the cheapest option but at the same time slowest too.

Pilot light:

In this approach, we basically keep a minimal set of resources running in another environment or datacenter, which will be ready to go live. We can spin it up when it’s needed.

What ever takes longest to spin up is kept in stand by mode and rest of the infrastructure we spin when needed. For example, if your database takes the longest to spin up in another env or to restore its backup, then that needs to be kept up with a small machine but in sync with the prod data, so, when needed, we scale the system up and it’s ready to go live. Rest of the infrastructure can be spin up using templates (IaC) like terraform or cloud formation.

It’s cheap to run this minimal infrastructure (+operational costs) and we are not paying for the rest of the infrastructure, until we need them.

Pilot light DR approach

Warm stand-by:

This is just further extension to pilot light approach, where in , we create more infrastructure in stand-by mode. Probably lighter versions of the actual infrastructure which are ready to get fully functional when we need them.

Warm Stand by

This adds additional costs because of the infrastructure, that is running the light versions, but, at the same time, this reduces the time required to spin them up from scratch. So, more money but less RTO compared to pilot light.

Multi site active-active:

This is the costliest approach of all the solutions, but have the shortest RTO/RPO. We basically have identical copies of production infrastructure which is running 24 * 7 side by side both in active status fronting by a load balancer. Best suitable for business generating critical applications.

Multi site Active-Active

Automate every operation that is possible like, backups, replication, launching of applications etc., use tools like terraform, ansible which helps in getting that “desired state” that we need, as soon as the infrastructure deviates from it. Use monitoring tools which can alert and notify or remediate the DR situation auto-magically.

Conclusion:

What ever the solution we choose, that needs to be tested thoroughly to ensure, that, during actual DR situation, it works. Monitoring tools tied up with alerts/events, using which, remediate the application back to its desired state automatically, is a very significant/elegant process, which requires a lot of testing. Automating these strategies saves a lot of time, which directly means meeting RTO/RPO SLAs.

In most of the companies, the slowest factor is the process. Getting an ECR or RFC to implement this/that etc., in prod. Please include these processes when strategizing a DR solution and test them. Waking up the only guy who have access, which depends on your solution, is a very bad-bad design, its a big no-no for DR.

Hope this helps you in your understanding or coming up with a DR solution. Let me know if you have any comments or suggestions, happy to hear them.

References:

https://d1.awsstatic.com/whitepapers/aws-disaster-recovery.pdf

https://www.networkcomputing.com/data-centers/disaster-recovery-public-cloud.

--

--