How to prepare for and navigate a technology disaster

Technology emergencies can be the most stressful moments of an IT professional's career. But they don't have to if you plan ahead.

Video: How one university could use virtual networking to improve disaster recovery

When it comes to a technology disaster, you can never be too prepared.

I provided some tips several years back on how to survive a critical system outage, which still remain relevant. Examples include staying calm, notifying users, handling the politics involved, proceeding in a methodical fashion, documenting the resolution steps involved, getting support, and staying confident.

SEE: Disaster recovery: How to prepare for the worst (free PDF) (TechRepublic)

Deep dive into disaster recovery

I revisited the topic by taking a deeper dive into the topic with Eric Dynowski, CTO at Server Central Turing Group, a cloud and colocation service. Dynowski has written about the vital importance of having a functional disaster recovery plan

Scott Matteson: What are the common pain points with disasters?

Eric Dynowski: Unknown recovery steps are a major area of difficulty. It is common for an organization to not know what is required to return applications, data and/or connectivity to service during an outage.

Unclear lines of responsibility are also a negative factor in these scenarios. Often there isn't a common point of ownership [leadership] where all communications begin and end. This leads to multiple people taking multiple actions, often simultaneously, which compounds the problem and delays the resolution of the issue.

Finally, underestimating the amount of time service restoration takes is a big pitfall. This specifically refers to the fact that it will generally take (at least) twice as long as you expect to restore applications, data and/or connectivity when you have an outage. This results in increased costs, lost revenue, and a significant decrease in customer (and end-user) satisfaction as they wait for service to be restored 'soon.' The reliability reputation of the IT department is also at stake here if staff is perceived as over-promising and under-delivering.

SEE: Systems downtime expense calculator (Tech Pro Research)

Scott Matteson: What are the most prevalent risks during an outage?

Eric Dynowski: Financial and reputation risks are the most prevalent. Any time you have system outages it's going to cost money, and it's going to negatively impact your reputation. 

Calculating the financial risk is relatively easy--as is understanding how much you can (or should) invest to minimize this risk. Calculating the reputation risk, however, is much more difficult. Many times organizations will focus on the impact on external customer satisfaction associated with an outage or disaster event. 

While this is true--and is a worthy risk to plan to mitigate--what is almost always overlooked is the impact on internal employee and end-user satisfaction. It is fairly common for employees who are negatively impacted by poor system performance or outages to "suddenly" leave for no apparent reason.

Scott Matteson: How should (or how will) disaster recovery tactics evolve over time?

Eric Dynowski: Disaster recovery will become less about having a plan and more about application and enterprise architecture. Instead of planning for what to do (should an event occur), planning will be done in advance to automatically mitigate outage situations. The speed with which these events are mitigated is (and will remain) solely based upon the level of investment made to address them.

Scott Matteson:  Where is the technology headed in this space?

Eric Dynowski: Two key developments will have the largest impact on business continuity and disaster recovery planning. The first is serverless architecture. Using this term very loosely, the adoption of these capabilities will dramatically increase application and data portability and enable workloads to be executed virtually anywhere. We're quite a bit of a way from this being the default way you build applications, but it's coming, and it's coming fast.

The second is edge computing. As modern applications and business intelligence are moved to the edge, the ability to 'fail over' to additional resources will increase, minimizing (if not eliminating) real and perceived downtime. The more identical places you can run your application, the better the level of availability and performance is going to be. This definitely isn't simple, but we're seeing (and developing) applications each and every day that are built with this architecture in mind, and it's game changing for enterprise and application architecture and planning.

SEE: Policy pack: Workplace ethics  (Tech Pro Research)

Scott Matteson:  Do you have any other tips besides these?

Eric Dynowski: Understand and quantify the financial risk, down to the minute, of downtime for each application or business process. This isn't trivial, but it is relatively easily accomplished. Once you know the financial risk, you can easily determine the investment strategy necessary to mitigate it entirely or to narrow it to more acceptable levels.

Understand the internal risk. How does system downtime impact employees? Are they losing the trust of their management for factors beyond their control? Are they losing the trust of their customers because of their inability to serve them? This is significantly more difficult than quantifying financial risk as employees will need to be extremely honest in their evaluation of the impact of service disruptions. However, without this knowledge, you are dramatically increasing the potential costs associated with outages and disaster events.

Also see

DRP, Disaster Recovery Plan

Image: Getty Images/iStockphoto