When a production environment is in a known working state, the only thing that can alter that state for better or worse is a change, either planned or unplanned.
Unplanned changes are a familiar beast to IT professionals. This can include crashed servers, failed hard drives, malware infestations and other emergencies. Planned changes (upgrades, reboots, server refreshes, etc.) can be even more disastrous if poorly executed, however, since a production outage can be the end result.
Production outages can be costly and disastrous, so change control is a hot topic. Change control means utilizing a standard method to apply changes to critical environments as a way to guard against risk and ensure necessary staff are aware of the ramifications. It's especially common in large financial environments where downtime can result in loss of business or irreversible damage to the company reputation.
Depending on the complexity, change control can be cumbersome and time consuming, or considered red tape by resentful IT staff. However, properly executed change management yields protective benefits to IT professionals as well, since all ramifications are more thoroughly vetted, the work is signed off in advance by multiple approvers, and if things go awry the blame game can be avoided.
Whether your company is large or small, if you're considering implementing (or bolstering) a change control policy, these 10 elements will be essential.
1. Plan the change
All aspects of the change should be planned out, whether it's as simple as "reboot the server" or as complex as updating code on a production system. How will it be executed and by whom?
Consider the ancillary details of the change and add these to the plan as well, such as advance notification of the end users regarding the temporary unavailability of certain systems or services, or engaging vendor support if in doubt with a specific procedure.
2. Estimate risk, and which hosts or services will be affected
Ask yourself: "What could go wrong? What will the impact be on related systems?" If you're upgrading an Active Domain controller which other systems rely on for user authentication, will these systems be inaccessible (hint: redundant domain controllers are a good idea). If you're upgrading a router and need to engage technical support, will your ability to do so be hindered if the router fails? If a series of patch installations fail, how long will it take to get the server up and running again?
Factor in even the slightest details. Is the power prone to go off in the server room and if it does during the change will this produce catastrophe? Will hands-on access to the server be required if a reboot fails? Is vendor support available on a 24/7/365 basis?
This step reinforces caution and may even cause you to rethink the change and attempt another, safer avenue if the risk level is too high.
3. Include verification of success
A successful change is more than just rebooting a server and pinging it to confirm it came back up. Determine what will constitute a successful change, such as making sure the necessary services loaded, there are no errors, and everything is otherwise working as expected.
Verification of success should be based on both an administrator and end user perspective. Let's say you updated the email server. You should plan for administrative verification such as establishing all features are available from the server side, but make sure to include end-user verification such as confirming that authentication works, email folders are present on the client side, contacts are accessible, etc.
4. Formulate a backout plan
One of the most crucial pieces of advice on this list is to develop a plan to reverse the change(s) if something goes awry. This may be as simple as uninstalling a patch or reverting to the use of the prior SSL certificate.
Some changes are of the "fail forward" type where they cannot be reversed, so the backout plan may be as complex as rebuilding the entire server after an irreversible code change. Consider worst-case scenario situations and increase the odds in your favor by determining whether you need spare equipment or alternate systems on hand, or whether you can take a snapshot of virtual machines involved in the change (if applicable) to quickly revert to their prior state. Even copying critical files elsewhere for safekeeping or performing a full system backup can be a lifesaver.
5. Test the process
In structured environments changes are implemented on test systems, then development or stage machines, then production systems last. This utilizes a set of layers so adverse change effects can be identified and resolved before they go prime time.
Of course, this depends on having a set of systems from low to high priority which model one another. Testing a change on a development system which has few, if any, similarities to a production box will yield little value. Some changes cannot be tested, such as on expensive one-off equipment which has no counterpart (an F5 load balancer for instance). Which is why this next tip will come in handy for these scenarios.
6. Establish a dedicated change time window
There's never a good time for downtime, but there are times which are less impactful (and stressful) than others. When considering a change on a specific system or set of systems, determine the timeframe during which these are least used. It may be 10 p.m., 2 a.m., high noon, on a specific day of the week, etc. Plan the change window for this timeframe.
I realize this may not be a popular tip for IT professionals if it involves getting up in the dead of the night. I've experienced this first-hand. But I would trade a 2 a.m. outage when I'm the only one using a certain system (or even aware that it's down) versus a 2 p.m. outage with an entire company negatively impacted and demanding status updates or breaking out the pitchforks and torches.
7. Assign staff responsibilities
If the change may involve staff from multiple departments, determine who should be responsible for which tasks in advance and assign them to the change process. This may include testing the results of the change, verifying the implementation thereof, or assisting with troubleshooting in the event of a problem.
This step reduces the chaos that might erupt following a change by ensuring the appropriate personnel has been preselected for their respective duties. The ability to leverage others for input if something goes wrong will greatly reduce the negative impact or downtime resulting from a change.
8. Document the change process via a request
Write up the entire details of the change including the plan, verification steps, backout strategy, testing outcomes, time window and assigned staff (in short, the results of the prior seven steps). This will fully document the process and, best of all, ensures it can easily be repeated later rather than having to go through the entire plan again and again.
Utilize a standard electronic form which can be customized to include all aspects of the change. Tech Pro Research offers a Change Order Form, for instance, which can meet these needs.
9. Leverage multiple sets of eyes for review and approval
Have the change reviewed by peers and managers alike to analyze it, look for potential pitfalls, and approve if valid. Peers can spot any technical challenges or possible improvements to the plan, and managers can sign off on the change to ensure all affected departments will be aware of what is to happen.
This may be another area that produces grumbling among IT professionals who may feel their own capabilities are being called into question or there will be too much red tape to seek out multiple parties to have them review and approve the change.
However, if something goes wrong not only will everyone be rowing in the same direction when resolving issues, but (as previously stated) this offers job security. If multiple parties have agreed upon all aspects of a change that's what's known as "CYA protection."
10. Conduct a post-mortem if an outage occurs
Despite your best efforts, an outage may nevertheless occur anyhow. Once the situation is resolved conduct a post-mortem (using the steps provided in the link) to determine what happened. Was the plan faulty? Did an unexpected, unrelated failure elsewhere cause the problem? How was it resolved and what can be done to prevent a similar occurrence next time?
One last word of advice which comes without judgment: staff may go rogue and circumvent the change process, especially in the early stages if this is a new implementation. Be prepared to put controls in place to monitor for unexpected changes (centralized logging and alerts, reviewing event logs, examining monitoring systems for scheduled or unscheduled downtime, etc.) Plan discipline, or better yet, incentives such as time off in exchange for late night change operations.
Scott Matteson is a senior systems administrator and freelance technical writer who also performs consulting work for small organizations. He resides in the Greater Boston area with his wife and three children.