Data Centers

The two main types of disaster recovery systems

Differences between synchronous and asynchronous disaster recovery systems

For the past several years, demand has been growing for data protection that transcends the tape backup systems of days gone by. Such systems do exist, and they're both robust and mature enough to handle enterprise-class data protection needs. These solutions are also scaleable enough that no matter how little or how much data you need to protect, you can find a product to suit your needs. This article will delve into the types of disaster recovery (DR) tools available today and see how they differ.

Defining DR systems
First, let’s nail down some definitions. DR is defined primarily as the protection of data in a secure facility (generally off-site from production machines) with the intent of saving the data in case of the loss of a data center or major data systems. DR does not include failover capability, which is the domain of high availability (HA) systems. We'll discuss HA systems in detail in an upcoming column. Many DR systems also include HA functionality; so if you are considering using both types of systems, keep that fact in mind.

Both HA and DR are part of the overall science of business continuity planning (BCP), which is the implementation of HA and DR for data systems, along with human resources and facilities management policies, to ensure that both your data and your employees are safe.

Many DR products are on the market today, so I won't look at specific packages. Instead, I'll go over the characteristics that most available products share. Generally speaking, DR systems are split into two main types, defined by the methodology used to replicate data from one location to another: synchronous and asynchronous data transfer systems.

Both DR systems let you create up-to-the-second backup copies of your valuable production data in another physical location. This allows the data to survive intact if the data center is lost for some reason, just as in a flood or fire. Unlike tape backup systems, the data is current and in a useable format, as it is already on a disk system and not stored on a tape, which must be restored to disk. A data center in Houston can be secured with a data center in Dallas, for example, allowing systems and people to be moved to another location and then resume operations with a minimum of recovery issues.

How synchronous systems work
Synchronous systems (also called “two-stage commit systems") are designed to make sure that no I/O transaction can be committed to the disk of the primary system unless and until it has also been committed to the disk systems of the backup system. Most of these systems are hardware based and involve the use of attached storage, like NAS or SAN systems. However, software-based synchronous systems are also available today. Figure A shows a typical software-based synchronous DR system.

Figure A

When an I/O request is initiated by any application on the primary system, that request is sent to the backup disk systems first (red line) and committed there. The system then waits for the confirmation of that commit to return from the backup disk systems (green line). Only then is the I/O committed to the primary disk systems (blue line). This ensures that nothing can be committed to the primary system unless it already exists on the backup.

While synchronous methodologies provide exceptional data protection and ensure that both disk systems are identical at all times, they have several drawbacks. These systems are generally much more expensive than even the best asynchronous systems, often costing millions of dollars to implement properly. In addition, because of the nature of the two-phase commit technology, I/O response time is much slower than normal for any given application, and severe distance limitations are put into play.

Generally, the systems must be connected via SCSI or fiber connectivity, meaning a maximum of about 10 kilometers can exist between the primary and backup disk arrays. This also means that (with a few exceptions) both primary and secondary systems must be on the same logical and physical segments. Some software-based synchronous systems do allow for DR across WAN segments, but this slows I/O response even more, as the commit signals must come back across the WAN.

However, if your application has a Recovery Point Objective (the amount of data transactions that can be lost during a failure) of zero bytes, synchronous systems are the only choice available, and they do a good job in this space. Financial trading data is a good example of a situation where such systems may be required, depending on the amount of money that could be lost with the loss of a single kilobyte of data.

How asynchronous systems work
For most applications and businesses, asynchronous DR technologies offer a much more cost-effective—and still quite sufficient—solution. Figure B offers a view of the typical asynchronous system.

Figure B

These systems are generally software-based and reside on the host server rather than on the attached storage array. They can protect both local and attached disk systems. In an asynchronous system, I/O requests are committed to the primary disk systems immediately (blue line) while a copy of that I/O is sent via some medium (usually TCP/IP) to the backup disk systems (red line). Since there is no waiting for the commit signal from the remote systems, these systems can send a continuous stream of I/O data to the backup systems without slowing down I/O response time on the primary system.

Most asynchronous systems have some methodology to make sure that if something is lost in transmission, it can be resent. Some can also make sure that transactions are written to both disks in the same order, which is vital for database-driven applications. In addition, since the usual method of transmission is TCP/IP, these systems have no real distance limitations, and there's no limit to splitting the primary and backup systems across WAN segments or subnets.

The main drawback is the potential for a few transactions to be lost during a failover event. If the primary server suddenly goes offline, anything waiting to be transmitted to the backup system will be lost. However, since this usually involves only a few transactions (and a few bytes of data), the performance of asynchronous systems is well within the required parameters of almost all business applications. For example: Exchange, SQL Server, and Oracle systems can easily recover during a failover event using these systems without the need for advanced recovery operations.

In addition to server-to-server replication, asynchronous solutions can allow you to send multiple replication streams from multiple primary servers to a single DR server, known as a many-to-one configuration. This methodology allows for protection of the data without the expense of obtaining duplicate hardware for each primary server in the DR location. Adding a SAN or NAS system into the mix at the DR site can further reduce overall TCO. Many of the new NAS systems are being shipped with replication software already in place or ready to be installed directly onto the storage device itself, allowing it to act as the host to receive replication data at the DR site.

Next step
While these systems take care of protecting your data well beyond the tape backup standard, we have not yet discussed how to keep the servers themselves up and running in the event of failure or emergency. The next article of this series will discuss such high availability systems in depth to help you to create a complete solution for your business continuity needs.

Have a comment or a question?
We look forward to getting your input and hearing about your experiences regarding this topic. Post a comment or a question about this article.


Editor's Picks