As a system administrator who lives near the office, I’m the go-to guy to handle stuff that breaks. I’m highly focused on disaster recovery strategies that can help me navigate the unique complications of the pandemic which have introduced new requirements and restrictions.
SEE: Disaster recovery and business continuity plan (TechRepublic)
I discussed the concept with Jennifer Curry, SVP of Product and Technology at cloud and colocation provider INAP; Nicholas Merizzi, principal, Deloitte Consulting LLP; and Andrew Morrison, a principal specializing in cybersecurity, Deloitte & Touche LLP, to get some tips on how to successfully navigate these unfamiliar waters.
Scott Matteson: What are some of the specific concerns regarding disaster recovery during a pandemic?
Jennifer Curry: The risk of enacting disaster recovery (DR) during a pandemic is more about people versus solutions and services. Unlike a natural disaster where you have to be concerned about where your primary and DR environments are and what type of restore you will perform, the pandemic impacts “people resources.”
Will you have the right people available to you to restore your environment or enact your DR plan? Do they have the right access already (since they are mostly remote now)? This is where we can ensure your Managed Service Provider (MSP) is able to assist with controlling your runbook, pressing the “easy button” to bring up the DR site, etc.
Nicholas Merizzi: During this pandemic many organizations across several industries have experienced unprecedented disruption ranging from supply chain challenges to employee productivity. Technology leaders should ensure their business continuity procedures can function in an all-virtual world. This means reviewing existing crisis management and communication platforms to account for working remotely.
For a successful DR, you may need to physically move, install, configure, and activate IT infrastructure. So, are the right people with the right skills actually available, healthy, and able to get to a technology facility? And secondly, are they able to access and enter an office or data center, and is it possible to safely work and maintain the correct COVID-19 protocols within that space? Establishing alternate contacts in the event of health issues is also critical during a pandemic. One assumption that is core to recovery is people. In addition, ensuring a robust suite of scalable productivity software to enable your virtual workforce in the event of a DR will be key.
SEE: MSP best practices: Server deployment checklist (TechRepublic Premium)
Andrew Morrison: From a cyber perspective, disaster recovery during a pandemic raises new challenges as well. The rapid expansion of remote work introduces new vulnerabilities. Many organizations have relaxed perimeter security controls to allow remote connectivity, introducing new threat vectors that threat actors can exploit to gain access to networks.
Lately, many of these attacks have focused on ransomware and data destruction, which encrypt data and often corrupt critical backup systems, rendering existing disaster recovery plans unusable. An “all hands on deck” approach to manual recovery is often the only response to these conditions. Unfortunately, social distancing protocols and remote work arrangements can make those manual recovery efforts an impossibility.
Scott Matteson: What are some examples of real-life disasters which have occurred? What was the impact?
Jennifer Curry: Years ago, New Orleans Civil District Court system crashed and wiped out more than 150,000 digital records, some dating back to the 1980s. The court had a cloud-based backup system in place, but unbeknownst to them, the installation failed during an upgrade years prior. The result: New Orleans Civil District Court lost not only its data and records, but also the ability to search for books and incurred more than $300,000 in costs to repair the damage.
As for natural disasters, the recent California fires highlight that typical or seasonal natural disasters aren’t the only threats. As we watch these unprecedented fires, businesses in the state should understand how quickly they can failover and at what point they should proactively bring up their DR site. Obviously, we should always stress regular testing of your DR plan but knowing the point at which you are comfortable making the call is equally as important.
Nicholas Merizzi: IT disaster recovery generally falls into one of two categories: A natural disaster event (earthquake, flood, etc.) or a system failure (such as failures in hardware, software or electrical). This year, actual DR responses we have witnessed have included issues with local or regional power outages, or power infrastructure issues. We have seen this across multiple industries including financial services with outages during peak customer windows and prolonged recovery times.
Andrew Morrison: Recently, the size and frequency of destructive data cyberattacks have increased substantially. These attacks differ from natural disasters in how they occur, but the result is very similar in that whole data centers and entire IT operations can be crippled.
SEE: Incident response policy (TechRepublic Premium)
Very public attacks such as the NotPetya attacks, which crippled major shipping and logistics companies, left IT systems virtually completely destroyed. While most disaster recovery and business continuity plans contemplate the loss of systems, applications, or even whole data centers, they only rarely account for a scenario where all data centers across the globe and all systems are rendered useless. We’ve seen industry reports that the operational costs that NotPetya drove exceeded $300 million per affected organization.
Scott Matteson: What are the special challenges involving data centers?
Jennifer Curry: Data centers are not immune from damage resulting from natural disasters. There’s no way to completely predict or protect from threats like fires, earthquakes or hurricanes without some kind of disruption. That’s why it’s important to make sure your data center has multiple levels of redundancy for all critical systems. But even with proper redundancies and risk management in place, there’s always some risk of downtime.
Cloud backups are still a valid option, but we strongly recommend the multi-layer approach to DR (backups, standby site, hot site, etc.) as DR isn’t one-size-fits-all, even within a single enterprise. Business systems have varying levels of significance to the continuing operations of an enterprise, and the DR plan should account for that. Not only to create the best financial model for DR but also to ensure that you aren’t wasting precious time bringing up applications or processes that aren’t truly critical when you must run in a failover environment for hours (or days).
Nicholas Merizzi: We would characterize three challenges that continue to cause datacenter disaster recovery capabilities to be strained. First is the nature of the applications themselves. Data centers become a disaster recovery issue when applications are dependent on a given set of hardware or location, and are unable to seamlessly process elsewhere. As we shift to a more hybrid cloud and microservices architecture, applications are intrinsically more distributed in nature. Components of an application might reside on one cloud provider while other functionality is delivered by third-party services. Ensuring these applications can function in a secondary site has added increased complexity for IT leaders.
The second challenge involved in DR is the lack of muscle memory. We see organizations spend significant budgets toward IT, yet they do not spend enough time building organizational muscle memory to ensure they can failover. Annual and semi-annual testing is required to ensure that applications can effectively be brought back online to support critical business functions.
Lastly, we are also seeing clients trying to increasingly protect against cyber threats. One of the challenges with traditional DR is that data is continuously replicated and designed to ensure no data is lost. However, how can organizations protect clients from the threat of destructive cyberattacks? We have seen clients shift gears to augmenting DR by building out isolated “cyber recovery vaults” to protect against cyberattacks focused on destroying critical data and the associated backup.
Andrew Morrison: Another major challenge with cyberattacks is the lack of clarity around when recovery can begin. With a natural disaster or outage, it is often clear that recovery can begin almost immediately after the event has passed or the outage is detected. A cyberattack requires often lengthy investigation and forensics to determine if the threat persists as well as the scale and scope of the attack. These investigations can take days, weeks, or even months. Recovery of data center assets may not be possible until it is clear that the attack has been remediated and will not reinfect newly recovered systems or data centers.
Scott Matteson: What should companies be doing now?
Jennifer Curry: Testing! Most companies already have an IT business continuity plan in place. But how many have actually tested it to make sure it’s still viable? Don’t wait until a disaster strikes to discover gaps.
SEE: Business continuity policy (TechRepublic)
Nicholas Merizzi: One of the common pitfalls that companies fall into is spending too much time assessing technology and the associated vendors. Companies should spend time understanding what is most important during an extended period of downtime. Understanding the needs of the business will help establish the right priorities and guide your assessment of DR technologies.
Andrew Morrison: It’s critical for companies to develop and enhance scenario plans and actively test overall responses for unlikely but highly impactful scenarios. Testing how to recover IT systems as well as how to recover all business operations in the wake of an extended, existential type disaster is key. For example, the anomalous COVID-19 pandemic was not well-envisioned or tested by most organizations, resulting in longer recovery time losing efficiency than may have been possible with better planning.
Scott Matteson: What should IT departments be doing now?
Jennifer Curry: Run a business impact analysis to assess cost of key infrastructure downtime and prioritize Tier 1 applications. Impact analyses usually include the following:
- Potential threats (hurricanes, earthquakes, fire, server failures, etc.)
- Probability of the threat occurring
- Human impact
- Property impact
- Business Impact
We actually provide a free Business Impact Analysis Template for companies to customize and use.
Nicholas Merizzi: CIOs should have resiliency as a core design principle that permeates all levels of the organization. In particular, IT departments today should ensure they have a strong understanding of their IT infrastructure and application landscape. Establishing a strong understanding of the linkages between business functions and underlying supporting applications will facilitate engaging with the business.
Strong IT asset management with automated discovery and healthy configuration management database (CMDB) of underlying dependencies will greatly improve an organization’s ability to maintain a functional DR. In addition, IT departments should ensure that business continuity remains at the forefront by engaging business continuity (BC) and DR teams in major modernizations efforts to certify that digital strategies embrace DR and do not put the organizations at risk.
Andrew Morrison: Identify critical data and systems and create an offline storage strategy for them. Many disaster recovery systems today have intentionally been designed to be online or cloud-based so that they’re more immune to physical disaster. Unfortunately, online and cloud-based disaster recovery systems can leave organizations more vulnerable to cyberattacks that leverage the fast replication of data backups and allow the corruption and encryption of a data destruction attack that can occur very quickly and with widespread impact.
Creating an isolated recovery solution that preserves critical data and business processes in an offline, immutable storage area can protect against these types of devastating cyberattacks.
SEE: Kubernetes security guide (free PDF) (TechRepublic)
Scott Matteson: What should employees be doing now?
Jennifer Curry: Communicating with IT. Truly understand the plan and communicate your critical processes and systems. Don’t take for granted that something devised a few years ago still applies. And be diligent on your own to secure the data most critical to you. (Do you have all of your files saved per the IT policy to ensure they are backed up?).
Nicholas Merizzi: One of the biggest challenges during a real-life event is finding yourself in a situation where key personnel do not understand their roles in the overall process. Ensuring that all stakeholders are aware of their duties and have designated backups who understand their roles are key for overall success.
Andrew Morrison: Be aware of “out-of-band” disaster recovery communication options that exist to conduct business in an alternative way. Most disaster recovery plans rely on relaying information to employees via email, for example; but, during an event, even corporate communications via email to employees can become challenging. We have seen in many disasters during which the seemingly simple process of contacting employees or management is difficult, as access to all systems that contain contact information or allow data are made unavailable.
Scott Matteson: How should businesses that have been devastated by natural disasters get back on their feet?
Jennifer Curry: If you have a successful DR site, you don’t have to rush back to production. If your DR plan didn’t go well, now is the time to re-architect and reset. Don’t put too much distance between the disaster and updating your DR strategy.
Nicholas Merizzi: Businesses should continue to expect a wide range of unpredictable events to impact operations and should therefore always design with resiliency in mind. While one cannot prevent all possible failure scenarios working on identifying weaknesses and hardening them can improve system confidence in the event of another future disaster. Technology teams should embrace new cloud-native software development principles. We continue to see an increase in adoption of roles such as Chaos engineers where faults are proactively injected into the ecosystem to understand behavior.
Andrew Morrison: It is also important to understand the ecosystem of third-party business partners that may be able to assist in rebuilding your organization’s data and systems. Proactively identifying which of your organization’s partners could temporarily assume some operations and/or contractual obligations can accelerate how fast you can stabilize your organization and return to business as usual.
Given the data sharing that occurs between trusted third parties, large amounts of your organization’s information may be available from your third-party relationships that could be used to rebuild some lost data.