I've been tasked with a system recovery project at my company. The system in question is our document management system. Without going into the details, I've determined a way to tell what the business impact will be (in financial as well as intangible aspects), if any one of the components (e.g. servers) of the system should crash and burn. I've also been able to determine how that financial impact increases as the outage time increases. Lastly, I've determined recovery solutions for each of these components which, if implemented, would take the $ impact effectively down to $0. Obviously, the recovery solutions involve data replication, dual systems, and so forth.

The problem that I've run into is in justifying the expense of the recovery solution. I've come up with a way to look at it that I would like some feedback on from this forum.

Let's say that the outage cost is X, and the recovery solution cost is Y. The question is, is it worth spending Y to eliminate X? It all depends upon the probability (Z) that an event will take place causing X to occur, doesn't it? The way I've looked at the justification is just like what you would do to see if a certain bet makes sense.

Example: if I was told that I could win $6 if I roll a three on one roll of one dice (a die?), how much would I be willing to bet? Well, the answer is obviously "no more than $1". In fact, $1 is my break-even bet. Over time, making that bet over and over, I'll never win anything, or lose anything. Betting more than $1 I'd lose, and betting less than $1 I'd win. So the formula is:

Possible Winnings * probability % = break even bet $6 * 1/6 = $1

Tranlating that same idea into my outage concern, the formula would be:

Outage event cost (X) * probability of event (Z) = justified breakeven expenditure to eliminate the event (Y)

I have the values for X. I can't find anywhere a way to get reliable values for Z. Mean Time Before Failure does not directly translate to outage probability. But since I do have the cost of what I'd need to spend for a recovery solution, what I've decided to do is make that Y, and solve for Z, and call it the "breakeven probability". So,

What I then will pose to the IT management staff is whether they believe that the REAL event probability % is more or less than my calculated breakeven probability %. If the REAL probability %is thought to be less than my calculated breakeven probability %, the recovery solution's cost is not justified. If the REAL probability is thought to be equal or more than the calculated probability, you do go ahead with that expense.

So, TechRepublic community, does this make sense to do or is there another way that is as objectively justified? I've seen many other approaches to justifying recovery solutions, but most of them are based on subjective analysis and fear ("Oh my God, we can't have that system down or we're going to lose customers). Most everyone can figure out the impact, but how do you factor in the probability?

One more thing. Having gone through the analysis, it appears to me that there are very few justifiable system recovery solutions. For example, let's say that I know that if SQLSRVR-01 burned out, that it would cost the business $500,000 while the system is being rebuilt and recovered from tape. I also know that I could spend $90K to build a cluster and prevent the outage from ever occuring. Using my math above, the breakeven probability is 18%. Well, I am damn certain that the chances of SQLSRVR-01 buring out in the next 3 years (lifespan of the document management system) is far far less than 18%. Dell would go out of business. It's more like .001%. Let's assume that I'm correct on the REAL probability being .001%. Guess what the projected loss would have to be before I'd build that cluster? $9,000,000,000. Not too many busineses in which that expense is going to be justified.

This conversation is currently closed to new comments.

to have it covered. In proj mgmt, risk analysis is very important and the formulas look the same as what I've seen in recent classes.

If you're reasonably convinced risks are low enuf to not justify certain redundancies, take a broader view: are there any cheap, simple things you could do to ameliorate the risks without duplicating systems?

e.g. buying spares of certain parts that are known to burn out, such as power supplies, disk drives, maybe even a cable or two if they could get cut accidentally, all with idea of minimizing time to get back into service.

And can you sign a service contract that specifies a quick time to fix the system as it is critical, or do you fix the hardware all in house? (even things like worm drives, if you use them, etc)

Then document and familiarize multiple people with recovery procedures, service contracts, etc. in case person who knows most is on vacation or quits or sick.

there's probably other things like beefing up physical security to prevent 'people risk' to the system, ..

As a matter of practice we do all the basic things you mention such as backups, service contracts and spare parts. Moreover, the backup is sent off site. All of our hardware has dual power supplies, more than one drive and usually more than one NIC. From the outset, I approached the analysis from a worst-case standpoint. Even though burn-outs of a server are rare, I never wanted to have someone come back to me and say "OK, you've covered the basics, but what if the worst-case happens?"

I suppose what could come out of this project is that we don't do anything new, but practice what we already do. That is, a recovery excercise.

If you're asking for technical help, please be sure to include all your system info, including operating system, model number, and any other specifics related to the problem. Also please exercise your best judgment when posting in the forums--revealing personal information such as your e-mail address, telephone number, and address is not recommended.

## Probability and justifying system redundancy expenses (math problem)

The problem that I've run into is in justifying the expense of the recovery solution. I've come up with a way to look at it that I would like some feedback on from this forum.

Let's say that the outage cost is X, and the recovery solution cost is Y. The question is, is it worth spending Y to eliminate X? It all depends upon the probability (Z) that an event will take place causing X to occur, doesn't it? The way I've looked at the justification is just like what you would do to see if a certain bet makes sense.

Example: if I was told that I could win $6 if I roll a three on one roll of one dice (a die?), how much would I be willing to bet? Well, the answer is obviously "no more than $1". In fact, $1 is my break-even bet. Over time, making that bet over and over, I'll never win anything, or lose anything. Betting more than $1 I'd lose, and betting less than $1 I'd win. So the formula is:

Possible Winnings * probability % = break even bet

$6 * 1/6 = $1

Tranlating that same idea into my outage concern, the formula would be:

Outage event cost (X) * probability of event (Z) = justified breakeven expenditure to eliminate the event (Y)

I have the values for X. I can't find anywhere a way to get reliable values for Z. Mean Time Before Failure does not directly translate to outage probability. But since I do have the cost of what I'd need to spend for a recovery solution, what I've decided to do is make that Y, and solve for Z, and call it the "breakeven probability". So,

Recovery solution cost/Outage event cost = breakeven probability%

What I then will pose to the IT management staff is whether they believe that the REAL event probability % is more or less than my calculated breakeven probability %. If the REAL probability %is thought to be less than my calculated breakeven probability %, the recovery solution's cost is not justified. If the REAL probability is thought to be equal or more than the calculated probability, you do go ahead with that expense.

So, TechRepublic community, does this make sense to do or is there another way that is as objectively justified? I've seen many other approaches to justifying recovery solutions, but most of them are based on subjective analysis and fear ("Oh my God, we can't have that system down or we're going to lose customers). Most everyone can figure out the impact, but how do you factor in the probability?

One more thing. Having gone through the analysis, it appears to me that there are very few justifiable system recovery solutions. For example, let's say that I know that if SQLSRVR-01 burned out, that it would cost the business $500,000 while the system is being rebuilt and recovered from tape. I also know that I could spend $90K to build a cluster and prevent the outage from ever occuring. Using my math above, the breakeven probability is 18%. Well, I am damn certain that the chances of SQLSRVR-01 buring out in the next 3 years (lifespan of the document management system) is far far less than 18%. Dell would go out of business. It's more like .001%. Let's assume that I'm correct on the REAL probability being .001%. Guess what the projected loss would have to be before I'd build that cluster? $9,000,000,000. Not too many busineses in which that expense is going to be justified.