I scraped together a few simple functional requirements to test against.
- Display the home page within 5 seconds.
- Keep this onerous task up for 1 week.
- Deal with one user at a time.
Obviously this is a little ridiculous. A one-page website that deals with one customer at a time can be powered with a sheet of paper and a pack of crayons. A real description of reliability requirements for an enterprise will stretch to many pages.
I have to fake an operational environment and see what happens. I run my production service for a week, gather some numbers on performance and failure, and compare these measurements with the requirements.
Predicting reliability before operation
I need a monitoring and alerting system that watches my service.
- from the inside, watching the components and
- from the outside, checking what a client would see.
To watch my system from the inside, I use the open source application Cacti because it is free, and my project budget is zero. I could use the basic CloudWatch metrics that AWS bundles with my EC2 machine. I get these for free.
That’s fine, but I am not entirely happy with the level of detail. I could enable detailed monitoring for a small fee, but I don’t need to.
To watch my system from the outside, I use the cloud-based monitoring service Monitor.Us.
Watching the inside with cacti
Cacti is an open source application which can show me the history of how much my system’s resources were used. It produces graphs of system activity - CPU, network usage, number of users logged in and so on. These graphs show me what’s happened in the last five minutes, the last few hours, week, and even year. A simple install of cacti keeps an eye on just the EC2 machine where it is installed, but it can also watch hundreds of other machines.
I follow this procedure to start watching my system from the inside.
- Install cacti to produce performance graphs.
- Extend cacti’s monitoring to cover all my EC2 machines.
- Make the new service do something, with testing people or a synthetic load generator.
- Collect a week of graphs.
I now have my first view on whether any component is likely to fail. If I have problems already, I am probably going to have an unacceptable level of failure.
Cacti is annoying to install, in a way that only open source products can be. Surely it would not survive as a closed source product: no paying customer would spend good money to fiddle with config for hours. It all starts off so easily with sudo yum install cacti, then rapidly descends into SNMP config and missing graph confusion. I admire the idealistic lawyer Professor Eben Moglen who said proprietary software is as ridiculous as proprietary math (although I did read that on Wikipedia, so he may have actually said “math is properly ridiculous”) and even I don’t look forward to installing cacti. Still, once past the pain barrier, it is a superb product that maintains plenty of easy-to-read summary graphs covering periods from 5 minutes up to 1 year.
(If you want the cacti install cheat sheet, please say.)
Watching the outside with Monitor.Us
I need to check the response time over the Internet and make sure the system is satisfying my requirements. I can look for a pattern in my results to help me figure out the consistency of my service.
Monitor.Us follows the freemium marketing model. Like AWS Cloudwatch, Monitor.Us provides the basics for free, which gets the attention of cheapskates like me, and charges for the clever stuff. I can get, for free, a regular HTTP check of www.internetmachines.co.uk and a response time graph for the current day (I actually want a week of graphs, which means either I have to pay close attention for a week or pay a little money).
The Monitor.Us service can double up as an operational monitor, which satisfies another of my operational readiness requirements.
Measuring reliability during operation
A permanent reliability monitor puts numbers to the pain of failure. Over the lifetime of my system, I can record its performance and evaluate my data. Cacti will eventually show me a graph of the whole year’s performance.
In theory, the more data I have, the better my picture of its reliability so I can improve my predictions with historical data. In practice, I have to be cautious. It only takes one small infrastructure change to remove the value of my measurements. Just because my service worked fine for years on IBM blades does not mean it will work fine on EC2 VMs.