My enterprise web service must be monitored all day, every day, for its whole life. A service must be monitored to satisfy one of my 12 principles of operational readiness.
If you get monitoring right, you will be able to prove how wonderful your service level is, track down errors before customers notice, fix incidents faster, and have a full set of measurements for system performance.
There are usually clear expectations of an enterprise service shared by everyone involved. During the design and build phases of a project, expectations come from an initial list of business requirements, from a growing familiarity gained during the project and from nailing down a formal SLA (Service Level Agreement). During the operational lifetime, expectations are adjusted as applications are updated and adapted to provide a better customer service.
- End-user experience. What is the user seeing? Are a client's transactions keeping customers happy?
- Behind the scenes. What is happening behind the scenes? How are the components of the application doing? Do I have enough resources to keep the application running?
What's happening client-side?
A monitoring application uses synthetic users to check the front end of a service. Synthetic users continually run transactions and collect data. The monitoring application records the time taken for these users to send a request to a web service and receive a response. The monitoring application can check these response times are within SLA and can graph these times for management reports.
The monitoring application can run a simple check on the home page to make sure no-one has defaced it. The app can run a complex transaction that works all the components of an application.
Monitoring an application from a remote client site is great for figuring out what kind of user experience they are having. If a customer wants to argue that the performance is garbage, the account manager can use measurements that come out of this kind of monitoring to head off arguments. Remote monitoring can also expose service weaknesses. Do I need a CDN to speed up the service for faraway users? Is a problem reported by one user caused by their ISP, or my cloud service?
Unfortunately, suppliers can't install monitoring applications at many customer offices, so they use the services of a company with a widespread network presence. At the bargain end of the monitoring services market, I used the free tier of Monitor.us in an earlier post. At the premium end, Compuware run services all over the world.
What's happening server-side?
A monitoring application uses all sorts of OS metrics to check the back-end of a service. Every cloud-based application is built from thousands of virtually moving parts.
I want monitoring to check the layers of hardware, networking, OS and applications. These must all be monitored. Cloud services are multi-tenant - my application's performance is affected by how busy the other tenants are.
I want to check all the application components. The application itself is probably distributed across a few tiers - perhaps a front tier that the client sees, most of the business logic in the middle, and data sources at the back. These must all be monitored. I want to know if any component starts to suffer, such as the database creaking under the strain of a busier site, or a program update introducing inefficient code. I installed cacti to check the OS when I was working on service reliability.
All the remote services that an application connects to must also be monitored. No enterprise service is an island. Monitoring is required for all the back-end integration. I can't stop someone else's service occasionally going slow, but I can collect measurements I can throw at the owners of those services.
What's happening over time?
The measurements collected form trends that can expose some problems and head off others.
- The usage profile will change over time. If the service is attracting more customers over time, owners will be pleased. If it is slowing down over time, owners will not be pleased.
- Trends can be used to predict what will happen. If increasing amounts of system resources are being used over time, someone needs to know that the bill is going to get bigger.
The performance of an application needs to be measured, and so does everything that can affect its performance, including the platform it is running on, the network connecting the application to the customer, and any other systems it relies on. Monitoring a cloud-based web service is not just a matter of regularly pinging it from your in-house monitoring application.
Creating comprehensive client-side monitoring is tricky. It's easy to install an open source application like Nagios on your in-house system yourself, but to get a decent spread of remote monitoring locations you have to use a geographically distributed service like Monitis or Gomez.
Creating comprehensive server-side monitoring is tricky, especially for enterprises with components spread across many geographical locations. You can add server-side monitoring yourself by stringing together some excellent and free open source monitoring applications, or you can rent the instant services of a company like LogicMonitor, New Relic or BMC.
If anyone expects anything of a service, then those expectations must be described in a way where success or failure can be measured. A service must be monitored from the outside, (where the clients are) and from the inside (all the server nuts and bolts). Monitoring continues for the life of the service.
Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the designers and developers who build the top layer that customers use.