As part of my operational readiness preparation, I want to make sure my new cloud application is reliable. Will clients think my service is reliable? What if it fails? Is there an acceptable level of failure?

What is reliability, anyway?

Reliability is one of those collective nouns with a meaning that is hard to pin down. It is important to make sure all stakeholders know what reliability is. Otherwise, at the first sign of problems, they will start using the word “reliability” in sentences containing rude words.

Before I start testing, I have to define what reliability is. Reliability means different things to different people. For instance, the people who work on different parts of a system perceive its reliability in different ways.

  • A database administrator sees reliability as accurate data. He makes a store more reliable by normalizing its data, to remove redundant copies.
  • A network engineer sees reliability as guaranteed message delivery. She works with reliable protocols (TCP) and unreliable protocols (UDP).
  • A re-seller of disk drives sees reliability as insurance for customers. He wants to convince customers of a disk’s reliability by advertising an MTBF (Mean Time Between Failures) of a million hours, or a long warranty period, or even a mysterious technology like SMART (Self-Monitoring, Analysis and Reporting Technology).
  • A researcher defines reliability as accurate web site content. The more up-to-date and impartial the information, the more reliable it is.

I also have to be clear on what reliability is not.

  • If my service always works as intended but fails to deliver what customers want, reliability is perfect. Validity is the problem.
  • If Microsoft Azure falls over because of a leap year bug and my service disappears, that does not mean my service is unreliable. Availability is the problem.

Define the kind of reliability that is perceived to be important. The most important perception is usually the one customers have, and that’s tied to the service provider’s bottom line: how much does it cost if a service is not reliable?

Testing reliability

Figuring out the reliability of a system is a tough call. Getting to grips with the world of reliability is the job of the reliability engineer. Reliability engineering is a discipline for complex systems, like the ones found in the car industry, the military, and telecoms suppliers. Reliability engineering can predict the success of a system, measure its performance under test conditions, and count the cost of system failure.

If I create a B2C service where the clients are people, the key to success is gaining their trust. A customer’s point of view is subjective. If they trust my Internet service to deliver, that’s more important than empirical data. Reliability is an important factor in their trust.

If I create a B2B service where the clients are machines, it’s all about the data. The service must return a valid response in a few seconds to meet the SLA, and it must continue to do so, reliably, for its lifetime.

For this new service, I can keep my testing lightweight. My new Drupal customer service transports information. If my service fails I know the only thing I have to deal with is some corrupt data and customer relationship damage. I am not dealing with compliance failure in a regulated industry, broken possessions from a failed goods transport, or injuries sustained from failed public transport.

The testing compromise

I don’t have the time to thoroughly test reliability before launching my service so I am not using tools from reliability engineering. I am not building probability models, creating extreme environments, or even creating a test strategy. Reliability engineering is too heavy for my new service because there is a limit to the amount of time I want to spend on testing. The longer I take to test the reliability of my service, the more time and money it costs me.

In the Internet service world, the traditional development lifecycle for a new product has given way to iterative development. There used to be a large test phase for each product, followed by many deployments to customer sites of a single version. The new agile way includes continuous testing, one deployment of the product as an Internet service and a flood of version updates. This leaves enterprises in the difficult position of trading cost and delay against testing.

Reliability tests for my new Drupal service

Since my service is not yet operational I don’t have any real-world data to examine. I don’t know how reliable it will be: I can only predict reliability.

My infrastructure tests are observation. From my infrastructure perspective I want to see a reliable system, with stable processes, plenty of capacity and plenty of normal activity, and I want to see it stay that way for a while.

In my next post I take a few simple measurements of my new service, from the inside and the outside.