Plugging in a new system requires more than just money and making sure all the interrelated parts play well together. Two key issues that CIOs may overlook in the rush to build a more stable environment are planning for variations in the use of the system—daily and seasonal changes in user and transaction levels—and ensuring that the system stays up and running at all times.
Both issues tie directly to the value of the system—obviously the first determination a CIO needs to make when a system is selected for implementation. Without an understanding of the business value, it’s entirely possible that a development team will build a system that costs more to operate than the value it brings to the company.
A good approach to connecting the dots between the business value and the capacity and availability expectations is an internal service level agreement (SLA) between the users of the system and the operations group. The SLA should be clear about the expectations for growth and the expected availability, including acceptable scheduled and unscheduled downtime.
CIOs need to consider several issues in creating an SLA focused on provisions for effective capacity and availability management. Capacity management involves planning, sizing, and controlling the new system so that it always meets the minimum performance expectations in the SLA. But a capacity management strategy can’t be designed to meet these SLA levels at any cost. The cost associated with meeting these performance levels also has to meet the business’s cost expectations.
For example, although you can meet performance expectations by installing your new system with clustered, hot standby servers for each major function, the cost associated with buying the additional hardware and software licenses to support this level of redundancy may be considerably more than the expected business benefit of the system.
Managing capacity effectively means that as the system grows, you’re able to add users or transactions without adversely affecting the existing users. For example, if the new system is a messaging system, you’ll need to understand in advance:
- How many users can run effectively on each server housing user mailboxes.
- How heavily they’ll use the system.
- How much space you’ll allow or they’ll require for message storage.
In a transaction-processing system, you’ll need to understand, among other things:
- How many transactions users will generate per day.
- How much load each of your application servers handles.
- How many transactions the database can process per second.
- What is an acceptable time for a completed transaction?
You’ll also take into account that individual transaction times are very important for online or help center systems, but less important for batch entry systems with large numbers of operators.
Once the system is installed and operating, you must make sure that you have effective system monitoring. Monitoring allows you to collect current and projected system usage and develop patterns of usage. The monitoring system should also provide characteristics of peak load times, which will allow you to determine which servers become less capable of handling their existing capacity in times of stress. Monitoring information should be collected and managed both on the server level and at the network level. Many times, the bottleneck isn’t at any of the servers but the backbone on which they communicate.
Effective monitoring will help smooth transaction or user loads between servers, ensuring that you can handle the peaks within agreed-on SLA levels.
Availability management ensures that the resources are accessible to the users or transactions according to the level agreed on in the SLA. As with capacity management, you should determine the level of system availability by balancing the benefits of availability with the associated costs.
For example, in a Web ordering system designed for 24/7 operations, the revenues generated in off hours should be sufficient to justify designing the support systems to guarantee high availability in those off hours. In “scale out” systems, in which you have multiple systems within each tier (Web farm, application servers, database servers), you should determine when it’s economically feasible to take some servers offline for maintenance while other servers continue operating. The remaining servers only need to perform at a level required to reach the expected economic benefit, not at the same level required during peak ordering times.
With many business transaction systems, it’s essential to define specific times that the systems will be unavailable to allow for system backups, hardware maintenance or upgrades, software changes or operating system upgrades, or configuration. Defining both the times and the operations performed during those windows in the SLA will help the system designers, users, and operations team build, use, and maintain the system while minimizing cost and maximizing economic benefit.
Having regularly scheduled maintenance windows allows the operations team to minimize the number and impact of system failures, increase the overall reliability of the system, and minimize recovery time in the case of a catastrophic failure.
By first determining the business value, and then by mapping out capacity and availability expectations using the internal SLA, you can decrease the chances of building a system too weak or too strong (too costly and underutilized). In consequence, you'll have the assurance that the system implementation and associated costs will provide a strong return on investment.