Nick Hardiman lists his 12 principles of operational readiness for an enterprise application built on the public cloud.
I installed an enterprise application on my public cloud-based virtual machine. Can I hand the application over to my enterprise colleagues for operational use? If so, I can tick this job off my list, grab a coffee and move on. But how do I know if it is ready for enterprise operation? How do I measure operational readiness?
Installing an enterprise application is not like installing a desktop application. Both types of application are handy shrink-wrapped knowledge, providing a popular set of functions to help people work. The difference is an enterprise application requires a lot of non-functional work -- I need to make the application work in an enterprise environment.
Here is the list of enterprise service operational principles that works for me. I measure my new enterprise application against my list, work with colleagues to fix the failures, and test again. When my application follows all the principles, it is operationally ready.
Enterprise service operational principles
These dozen statements describe how an enterprise service should be. If all these statements apply to my new enterprise service, I can happily stamp my operational approval on it.
I have provided a few examples to make these statements a little clearer, but I have not described the actions required to get there. As you can imagine, putting these principles into practice for enterprise services is complicated. It's almost impossible to get everything right.
Many cloud innovators provide services in one or two of these areas, to ease an organization's workload. You can pay Green Hat to provide cloud-based performance testing tools, Core Cloud Inspect to check security, and Cloudkick to monitor infrastructure. A few big players like EMC and Novell have enough tools to take all the responsibility. The bigger your wallet, the more responsibility you can avoid.
My enterprise service:
- has been functionally tested. If a new business application has not yet been signed off by the guy paying the bills, I will waste my time carrying out operational tests.
- has capacity. Sysadmins may want to scale up the disk space for a storage service and the bandwidth for a video chat service. They may scale down to a pocket calculator for a monitoring service.
- is resilient. This is the world of High Availability: double up on single points of failure, improve code quality, and even if something does fail, make sure the service handles it gracefully.
- is recoverable. If the student deletes half the files or the computer room catches fire, service can be restored.
- is reliable. Customers use Internet services 24 hours a day, but an intranet may only be needed during office hours. An intranet that is down every night may still be perfectly reliable.
- is scalable. What if the new service has traffic spikes or gets really popular? I may need to scale out by adding more servers. Wading through treacle is not attractive.
- is monitored. The operational support people must be alerted immediately if someone breaks into the computer room, if upstream services disappear, and if a process goes berserk.
- is supportable. If an architect designs an Internet bank that only runs on one server, how pleased will customers be when an operator turns off the bank to upgrade the memory?
- is secure. Vulnerabilities get patched, an IDS (Intrusion Detection System) watches the network, and the security team have signed on the dotted line.
- has been pushed to the limit. The whole system has been thrashed, bottlenecks fixed and the system thrashed again and again. The service owner then knows how much performance can be squeezed out of her service.
- has integrity. The customer support people won't be plagued by calls from customers whose data is inconsistent, whose files have disappeared, or whose transactions were duplicated.
- will operate within the SLA. The people sponsoring this service deserve to know how their investment is doing. The service builders automate the measurement and reports of the service level. Stakeholders can then help a failing service to succeed.