Data Centers

The elements of business continuity planning

Whether you are creating a business continuity plan from scratch or updating an existing one, know what these elements mean to your recovery objectives.

Recovering from a business continuity event (BCE) requires system-by-system planning with thorough understanding of how each computer and network component impacts critical business outcomes. Consequently, business continuity planning (BCP) must include several documented, tested, and practiced sets of tasks that support:

  • System dependency mapping
  • Maximum tolerable periods of disruption (MTPOD)
  • Mean time to repair (MTTR)
  • Recovery time objectives (RTO)

Incorrect analysis of one or more these BCP objectives will likely result in irreparable harm to your organization if a critical business process fails.

System dependency mapping

A system rarely stands alone; most systems are part of a set of technical components providing support for one or more business processes. They act like an internal IT supply chain. Figure A depicts a set of systems (Sn) that provision order entry (S1), order processing (S2), invoicing (S3), and shipping (S4). Orders input into S1 are finalized by S2. S2 then feeds S3 where invoicing occurs, and warehouse pick tickets are generated by S4. Each system must perform as expected to achieve the outcome expected by the customer: delivery of product as promised. Consequently, the first step in BCP for a critical process is understanding all supporting systems.

Figure A

Internal System Supply Chain

Because of the space constraints of a blog article, I haven't included network components in this graphic. However, between each of the systems exists cabling, switches, routers, IPS/IDS, etc. In addition, cloud services providing a complete system or system component are also important elements of your internal IT supply chain.  Include all network and cloud services in your supply chain graphic. Your graphic will likely be much larger than mine.

Maximum Tolerable Period of Disruption

MTPOD, sometimes called maximum tolerable downtime, is the total time a business process can be inoperative before a business suffers irreparable harm. MTPOD includes the aggregate MTTR and process cycle time. Cycle time represents the period necessary to complete a single iteration of the affected business process from the point of failure. For example, if S3 (Figure A) failed, cycle time would be the period required to process all input from S2 and ship product.

Mean Time to Repair

While MTPOD refers to the affected process, MTTR applies to individual components (system and network devices). It is the average time required to return a failed step in a process to normal operation. Aggregate MTTR is the average time required to restore all failed systems or components during a widespread business continuity event. MTTR is affected by many variables, including:

  • The type of failure. A cable failure is much easier to repair than a power supply failure.
  • Availability of replacement parts. Many organizations keep spare parts on hand for critical components, including cabling. If parts have to be ordered or installed by vendors, lead times, travel times, etc. are a necessary part of MTTR.
  • Internal monitoring capabilities and skill sets. How long does it take an IT team to identify a failure and determine its root cause? Proper staff training and maintenance of up-to-date system and network documentation provide critical support to this effort.
  • Availability of key internal personnel. Time-of-day, notification processes, and proper time-off management affect arrival of staff necessary to manage a business continuity event.
  • Maintenance in place. The time it takes for vendor response and delivery of replacement parts is directly affected by formal agreements and SLAs.
  • The effectiveness of BCP, including disaster recovery.

Each component has an MTTR unique to your organization.  Adjusting SLAs, documenting and practicing incident response, and ensuring key personnel are on-call are examples of conditions that can shorten MTTR.

Recovery Time Objective

The RTO is the point at which failed devices must be operational, given process cycle time (see Figure B.). The aggregate MTTR cannot exceed the RTO. If it does, the time to produce the required output will extend beyond the MTPOD. Disaster recovery exercises are a good example of testing RTO. If the process recovery period extends beyond the RTO, MTTR adjustments are necessary for one or more recovered process components.

Figure B

Cloud services affect MTPOD

Controlling business continuity event planning is relatively easy when all components are in an organization's own data center. However, difficulties can arise if due diligence is not practiced during cloud service provider selection and contract negotiations. Figure C depicts what our example process might look like if order entry is moved to a provider. The organization no longer has direct control of the infrastructure, platforms, or software required to maintain process continuity.

Figure C

Moving one or more systems to the cloud provides one significant advantage to the organization: containment of catastrophic event effects. For example, the loss of this organization's data center requires recovery of S2, S3, and S4. However, the only recovery activity for S1 is restoration of connectivity to S2. This serves to make it much easier to reach the defined RTO.

Problems can arise when the failure is at the provider site. SLAs, sanctions, customer audits, and contractual obligations control and monitor the MTTR for S1 in our example. The reputation of the provider, supported by discussions with existing customers, is a good measure of the provider's willingness and ability to recover within the expected MTTR. In any case, a provider that cannot recover within RTOs for affected business processes is likely not the right solution for your business.

The final word

Recovery from business continuity events, those situations in which business processes fail, requires close attention to MTPOD.  Adjusting the MTTR for all system and network components helps achieve the RTO element of MTPOD.  It relies on several factors, including quick detection of root causes as well as the availability and capability of recovery personnel.

Cycle time is another crucial factor when calculating MTPOD.  The period necessary to produce the first set of outcomes, once teams restore a process, cannot exceed the remaining period between the RTO and the MTPOD endpoint.  If it does, the likely solution is adjustments to aggregate MTTR.

Moving one or more process components to the cloud can help reduce aggregate MTTR when a catastrophic event occurs.  The right provider, governed by fair but aggressive customer oversight of SLAs and contractual requirements, can make recovery easier.  The wrong provider can drive cloud-hosted component MTTR beyond RTOs.  Choose wisely.

About

Tom is a security researcher for the InfoSec Institute and an IT professional with over 30 years of experience. He has written three books, Just Enough Security, Microsoft Virtualization, and Enterprise Security: A Practitioner's Guide (to be publish...

0 comments