Probability and justifying system redundancy expenses (math problem) - TechRepublic
General discussion
February 16, 2006 at 04:13 PM
erikehlert

Probability and justifying system redundancy expenses (math problem)

by erikehlert . Updated 20 years, 4 months ago

I’ve been tasked with a system recovery project at my company. The system in question is our document management system. Without going into the details, I’ve determined a way to tell what the business impact will be (in financial as well as intangible aspects), if any one of the components (e.g. servers) of the system should crash and burn. I’ve also been able to determine how that financial impact increases as the outage time increases. Lastly, I’ve determined recovery solutions for each of these components which, if implemented, would take the $ impact effectively down to $0. Obviously, the recovery solutions involve data replication, dual systems, and so forth.

The problem that I’ve run into is in justifying the expense of the recovery solution. I’ve come up with a way to look at it that I would like some feedback on from this forum.

Let’s say that the outage cost is X, and the recovery solution cost is Y. The question is, is it worth spending Y to eliminate X? It all depends upon the probability (Z) that an event will take place causing X to occur, doesn’t it? The way I’ve looked at the justification is just like what you would do to see if a certain bet makes sense.

Example: if I was told that I could win $6 if I roll a three on one roll of one dice (a die?), how much would I be willing to bet? Well, the answer is obviously “no more than $1”. In fact, $1 is my break-even bet. Over time, making that bet over and over, I’ll never win anything, or lose anything. Betting more than $1 I’d lose, and betting less than $1 I’d win. So the formula is:

Possible Winnings * probability % = break even bet
$6 * 1/6 = $1

Tranlating that same idea into my outage concern, the formula would be:

Outage event cost (X) * probability of event (Z) = justified breakeven expenditure to eliminate the event (Y)

I have the values for X. I can’t find anywhere a way to get reliable values for Z. Mean Time Before Failure does not directly translate to outage probability. But since I do have the cost of what I’d need to spend for a recovery solution, what I’ve decided to do is make that Y, and solve for Z, and call it the “breakeven probability”. So,

Recovery solution cost/Outage event cost = breakeven probability%

What I then will pose to the IT management staff is whether they believe that the REAL event probability % is more or less than my calculated breakeven probability %. If the REAL probability %is thought to be less than my calculated breakeven probability %, the recovery solution’s cost is not justified. If the REAL probability is thought to be equal or more than the calculated probability, you do go ahead with that expense.

So, TechRepublic community, does this make sense to do or is there another way that is as objectively justified? I’ve seen many other approaches to justifying recovery solutions, but most of them are based on subjective analysis and fear (“Oh my God, we can’t have that system down or we’re going to lose customers). Most everyone can figure out the impact, but how do you factor in the probability?

One more thing. Having gone through the analysis, it appears to me that there are very few justifiable system recovery solutions. For example, let’s say that I know that if SQLSRVR-01 burned out, that it would cost the business $500,000 while the system is being rebuilt and recovered from tape. I also know that I could spend $90K to build a cluster and prevent the outage from ever occuring. Using my math above, the breakeven probability is 18%. Well, I am damn certain that the chances of SQLSRVR-01 buring out in the next 3 years (lifespan of the document management system) is far far less than 18%. Dell would go out of business. It’s more like .001%. Let’s assume that I’m correct on the REAL probability being .001%. Guess what the projected loss would have to be before I’d build that cluster? $9,000,000,000. Not too many busineses in which that expense is going to be justified.

This discussion is locked

All Comments