Application failure and service disruption can rapidly short-circuit a company’s online sales, customer service, or internal processes. In 2011, Japan’s Mizuho Bank suffered a system meltdown that shuttered its massive ATM and Internet banking services for several days and delayed processing of 1.16 million transactions, worth a total of about $10 billion. A software glitch at a New Zealand telecommunications company resulted in 47,000 customers being incorrectly charged for hitting their data limits early. The company reimbursed customers in a $2.7 million settlement. While these are examples of massive software failures, the types that get senior executives fired, even minor application disruptions can cost a bundle and create management discontent.

IT departments need to consider the ramifications of a service outage, including how to know when a meltdown is looming. This article will discuss the warning signs of application failure, and offer advice for spotting issues before they become fatal.

1. Define failure

Enterprise application failure takes many forms, but it’s the day-to-day glitches disrupting customer service or employee productivity that are most common. IT wastes precious staff time trying to troubleshoot and fix problems, while the business loses money. Small changes to software can create unknown problems affecting business results. Google, AOL, Bing, Shopzilla, and other Internet leaders recognize that even small increases in response time – 4/10s or .5 second – would theoretically have zero impact to users, but can take a measureable toll on site usage and therefore revenues. A business should define upfront what constitutes application failure in its core systems, in terms of unacceptable results for the business, users, or customers.

2. Watch end-user response times

What causes employees or customers to call the help desk in a flash is when a site, application, or transaction begins to run noticeably slower. First, define the threshold for acceptable response times — which will vary depending upon the criticality of the system and related transactions. This data comes from a variety of sources, including server logs, website metrics tools, and whatever monitoring data is available from your individual applications. Ideally, a company will have a centralized application monitoring system that aggregates response times across systems and websites so that IT can see in one place where business-critical transactions are failing.

3. Keep an eye on transaction volumes

These days, everyone’s talking about “Big Data,” and how companies can capitalize on the petabytes of data streaming from the Web and other places. The downside of the data explosion we are now experiencing is that transaction volumes can spike unexpectedly. Knowing when these spikes typically occur is one way to stay ahead of an application meltdown. Otherwise, if your storage and server systems are not provisioned correctly, the application or website may buckle under the pressure.

For instance, a company like Yahoo! needs to plan for transaction volumes early in the morning in its top markets, when users are checking the news. Conversely, a gaming site might experience spikes in the late evenings when people get home from work and are ready to play. Planning for those expected capacity shifts is standard course, but what about the unexpected spikes? This requires monitoring solutions that keep tabs on the average response times for a particular application, and alert IT when response times begin to inch upward.

It’s also critical to obtain an intrinsic understanding of the performance characteristics of the application. A stock-trading application deals with small data per transaction but potentially large transaction volumes, depending upon time of day or related events. A video production system, however, manages far fewer transactions but enormous files. The app manager needs to consider the timing of those large file transfers with other demands upon the infrastructure, in order to make just in time changes to maintain required service levels.

4. Manage shared infrastructure

With the advent of cloud computing and virtualization, many organizations are running IT off of a shared, dynamic infrastructure. Virtual machines are running on pools of servers and storage, shifting around application loads according to business rules. This sometimes creates application contention issues, when two applications are competing for the same storage drive at the same time. Network optimization can help, but IT needs to pay particular attention to capacity planning in these new environments. In our experience, application issues more often occur in the storage and database layers, not the application itself. The ability to see through all of these layers quickly can help an IT organization stay on top of shifting supply and demand in the private cloud.

5. Develop an application performance management plan

Firefighting and cleaning up messes can be minimized if a company takes the time to develop a plan and process for application performance needs and business SLA requirements. This entails mapping business requirements to application requirements, setting thresholds and metrics for intervention, and continuously monitoring the entire infrastructure from user/transaction to disk. These tools and processes can tell IT managers how long until storage will be at risk, and the impact of infrastructure adjustments on capacity.

Planning entails determining what can be archived and what truly needs to stay “live.” Which applications require the fastest storage and servers, and which can make do with lower-cost devices? Balancing performance with cost and business needs is crucial; there really can’t be enough refinement of this exercise in the modern enterprise.

Sherman Wood is VP of Products, at Precise.