Tech Tip: Determine an acceptable recovery time objective

Here's why you should determine an acceptable recovery time objective.

By Mike Talon

When crafting your organization's disaster recovery plan, one of the first aspects that you must decide is the recovery time objective (RTO), the measurement of how much time can pass between the failure of an application and the time when it becomes available to end users again after recovery.

I want to point out two things to keep in mind. First, you need to base these time measurements on an application-by-application basis, not on a server-by-server basis. Second, the clock starts when the application goes offline, and it doesn't restart until end users can effectively use the system again.

Properly determining the RTO for any individual application might seem like a simple task--just ask the appropriate users of the data system how long they can conceivably live without the application. But in reality, this can be more of an art than a science, especially since end users of even the most mundane system will typically declare they can't go without it for more than a few seconds.

Often, getting users to agree to a larger margin of flexibility requires explaining that it would cost roughly the debt of a small nation to ensure such availability. Once you manage to get end users to decide on a reasonable estimate of how long they can go without their applications, you must combine the individual time estimates for each set of applications on a single server.

In the Windows world, where many times there's only one large application on a server, this may be easy. However, in the Linux and UNIX areas (including Solaris), there could be dozens of applications on each server.

You must first determine if any application takes precedence over others on the same machine. If that's the case, use that application's RTO since it's the most important and will no doubt have the smallest RTO number.

If that isn't the case, you must determine the best mix between importance and recovery time, and you need to select an RTO that meets that analysis. If your budget allows, I recommend going with the shortest RTO on the machine. But if your budget is tight, consider using the mean time between all estimates.

Once you've determined the RTO, you can combine it with other factors such as recovery point objectives (RPOs) to determine which hardware and software solutions offer the closest match for your company's needs within its budget. This should give you adequate protection, budget withstanding.

If not, you'll at least have a paper trail that clearly shows what you needed to protect the organization and that the lack of budget is why it failed. Collecting and analyzing your RTO numbers is a great first step in establishing a disaster recovery plan--and it's one that's absolutely necessary.

Mike Talon is an IT consultant and freelance journalist who has worked for both traditional businesses and dot-com startups.