Mistakes happen. People who make a claim that they can produce bug free product are lying either to you or to themselves. And it is debatable what's worse. Anyone who worked in the tech industry for a few years has a couple of horror stories up their sleeves about the mistake. Some of those stories are amusing, in retrospect of course, some are pretty disturbing, but all of them clearly demonstrate one point - there is no perfection.
As the systems today become more and more complex, it is virtually impossible to avoid all the mistakes and implement a bug-free solution. So once you accept it as an axiom, the accent shifts from the question "How to avoid all mistakes?" to "How to minimize the impact of a mistake?"
There are a couple of common misconceptions in a development world that could potentially be costly to the companies. One of them is a developer's belief that the work is "done" when it is deployed to production.
While it is true from the sign-off and acceptance perspective, it is of very little comfort to company when it realizes that the brand new holiday campaign that launched last week (tested and approved), is not actually collecting order information on "Black Friday". Or, taking a well-known example, Y2K bug breaks applications developed years before the year 2000.
Bottom line is - the solution needs to be working not only today or tomorrow, but months and years from now. And by marrying this goal with a previously derived axiom that no solution is 100% bug-free, we have a pretty interesting challenge on our hands. Which brings back the question of how to timely identify and mitigate the impact of a mistake.
If you expected revelations of all the world secrets and conspiracies after the long preamble - you're in the wrong place. The answer is simple. Monitoring. Anyone who is running a website in this day and age employs some monitoring strategy to make sure the site is up and running. Most of these people firmly believe that the monitors in place are sufficient to run their business successfully. And most of them are wrong.
Flaws in "traditional" monitoring
Complexity of the web applications today has gone far and beyond the capabilities of the "traditional" monitoring. Keeping tabs on uptime and responses of your site will not paint a full picture of the application performance and, consequentially, will let the problems slip through the cracks.
Twitter serves over 20 million unique visitors a day ... and is legendary for its downtime. Now, traditional HTTP checks will notify operations teams if the site is to become unavailable immediately. But what happens is the site is, seemingly, up and running. For example, HTTP checks return 200 code from target page checks and the browsing trends are above threshold.
Everything is up and running, users are happy; the operations team can go out for a few beers. Right? Wrong! One of the more recent problems with Twitter was lost tweet data. The site was up and available for browsing, the API accepted post requests and returned success codes, but the post never registered with Twitter. From the standpoint of basic checks the site was operational, but not from the standpoint of frustrated users.
Now, in all fairness, because of viral popularity and non-sensitive data, Twitter's constant issues do not significantly affect the bottom line for the company, unlike most companies, who would be paying for the business downtime either in hard dollars, opportunity cost, or both. In order to minimize these costs, the companies need to identify and implement specific business rules that would provide a sound base for measuring the availability and success of the service offered.
Business rules for developers
This brings us back to the developer's perspective. By establishing business rules, you not only establish the base for business success, but you can also establish the success criteria for the job being "done" after the launch. If the project involves site registration (beta sign-up pages, membership sites, etc.), a viable business check would be to make sure the hourly number of registrations does not drop below the set threshold.
If you're working on an e-commerce solution, credit card transaction success vs failure ratio would be a good measurement to be certain that the process works as expected. Note that business checks do not conflict with system checks. They can (and should) be used in conjunction with each other.
Who to blame?
There is a clear line of responsibility between system administrators and developers, which is often blurred, leading to either effective teamwork or cross-department finger pointing. Often, operations have no knowledge about the application-specific functionality or how the systems changes would affect the application. System administrators are responsible for system health. Developers are responsible for application health.
Following the above example, if your web server is offline - the application will not be accepting any registrations. But bringing the server back online does not guarantee successful application resumption. If there are no monitors validating application behavior - there are no problems. From the standpoint of the operations team - everything is up and running. From the standpoint of the business owner - not so much.
To wrap up...
Development of business and functionality monitors should be a part of any project scope. Period. The application may be elegant, may be extensible, may be near-perfect, but if that "near" rears its ugly head without anyone noticing and acting in a timely fashion - no redeeming quality of the application can negate that. There is a variety of tools to help the developers to get the job done and be compliant with company's monitoring guidelines. Get them, learn them, and use them.
Leon Fayer is Vice President, Business Development for OmniTI, a provider of web infrastructures and applications for companies that require scalable, high performance, mission critical solutions.