This is the latest installment in my ongoing Ask the IT Strategy Guy column, where I answer questions about reader’s most difficult technology problems. If you would like to participate, please e-mail me at the address below.

Dear IT Strategy Guy,

Our company recently switched ERP systems. This broke just about everything it could. The problem lay mostly in the fact that our existing systems have been added to, patched, and spackled over the years and no one could really predict everything that went wrong.

What do we do now?!?!

–          Sleepless in Seattle

Dear Sleepless,

No person or system is perfect, and the occasional failure should be expected to the point that you have some semblance of a plan you can dust off in response to an implementation gone bad. Failing that, here’s my four-step program for fixing failure.

1) Plan your triage strategy

You are probably already getting hit from all sides with problem reports, angry users, and concerned customers. While it may be tempting to dive right in and start attempting to fix things, take a momentary pause and get some support structures in place. Leverage any existing help desk and defect tracking systems you have and assign a small team of trusted people to serve as “air traffic control,” processing incoming problem reports, assigning them a severity, and passing them to queues of technicians for a fix. This group should monitor the most severe issues and also delegate status updates to critical interested parties in management or among your key customers.

You need not spend weeks building the perfect systems and process, but an afternoon and evening spent identifying the players, setting up some infrastructure, and clarifying who has decision-making authority on which problems get the most rapid treatment will go a long way in the stressful days to come. If nothing else, get sympathetic ears on the phones who speak your local language competently. Their job is not to solve problems (except for the most basic); rather it is to provide some sympathy, log the problem, and move on to the next call.

2) Apologize and provide a contact method

This might seem like a wise first step, but without some forethought into how you will handle the onslaught of complaints, apologizing then subjecting your constituents to busy phone lines and nonresponsive e-mail addresses will only enrage them further. Most people are willing to give you the benefit of the doubt after a brief but heartfelt apology, assuming you follow through on the promises you make in giving it. There’s no need for self-flagellation or public flogging here; explain that you screwed up. You have your best people working on fixing everything, and here’s how to report your problems.

I highly recommend a dedicated support channel or at least a “hot button” on your usual contact channels. A great triage plan and thoughtful apology are quickly forgotten when your customer calls the number you provided and is subjected to a 37-option phone menu tree.

3) Fix stuff

This sounds simple, but getting through the first few days of a major failure will require diligent effort, particularly around managing the deluge of support requests that are headed your way. Closely monitor your team of “air traffic controllers” and ensure they are properly categorizing problems and are keeping tabs on the most critical. If you are in high-level management, it wouldn’t hurt to call key constituents (executives, key customers, critical partners) and personally provide a status update. If a highly visible enterprise system is the object of the failure, consider scheduling regular (perhaps even twice daily) conference calls with a review of key problems and their current status. Again this need not descend into a public flogging; rather provide a summary of the biggest problems, their current status, and an ETA on a fix. Reiterate your contact channel and get off the phone.

If possible and without hampering the efforts of the team, track the time and resources spent on repairing the failure. You will use this information in the next step.

4) Analyze what went wrong

When life finally begins returning to normal, the tendency is to breathe a sigh of relief, then run for the next project in an effort to wash the bad taste of failure from your mouth. This is exactly the wrong approach, since you lose critical lessons from the failure.

In the case of Sleepless in Seattle, it seems an overly complex system was not adequately documented and not thoroughly tested. With technologies like virtualization being so cheap and ubiquitous, there is almost no excuse for not creating a virtualized copy of your environment where you can test developments, patches, and even wholesale migrations to a new system. After tallying up the cost for repairing your failed migration, there is likely a strong business case to be made for a more robust test environment that can be leveraged for everything from disaster recovery to training and “dry runs” of future platform migrations.

If nothing else, document what worked and what didn’t with your triage and communications plan, and work to create an “off-the-shelf” plan you can pull down next time a critical implementation goes south. Perhaps the worst result of failure is not learning from your mistakes and implementing the results of your study.