Monday afternoon began peaceful enough. Being “on-call” for my daytime IT job means occasionally

carrying a pager and cell phone for a week of 24/7 paranoia. It also means I stay as late as needed to

perform scheduled server reboots, which are typically preceded by applying the

latest Microsoft Security Updates and Hotfixes.

Sounds simple. But on this day, I

had a particular Windows 2000 Server that seemed inclined to remind me it was,

in fact, a Windows box. And, oh yeah, it

was after all a Monday.

Microsoft made it easy.

A quick connection to the Windows/Microsoft Update site revealed a

whopping 24 updates were needed and recommended for immediate

installation. To my credit, I was smart

enough to first install the updates on the designated “test” server, which

revealed no evident issues. And then, at

approximately 6 PM, I casually clicked the Install


Updates button, and waited for thirty minutes while the installation

completed. One reboot later revealed a

fully functioning system with no warning or error messages in the system event logs. I could go home and enjoy my evening…

…But, no. Let me

explain. Arriving home to find your

spouse speaking with your employer’s Help Desk is never a good sign. It seemed no one located at any of our five hospitals

could print their much needed patient labels and medication barcodes following

the reboot. This ONE Windows server functions

as the sole print server for all 1,400 printers at our remote sites. (Note: I only rebooted it, not designed

it.)

I hadn’t panicked at this point. So as my family ate dinner, I connected to

the server via pcAnywhere and thought a quick restart of the print spooler

would coerce the printers to fulfill their life’s purpose. A subsequent look at the printer queues

revealed no jobs were queued and nothing was being printed. The Windows System Log showed a series of

Informational Events; Event ID 9, “Printer xxx was set”. In fact, there were only Event ID 9s in the System Log.

That was because the maximum size of the log was set to 1 MB which made

it appear useless. Further investigation revealed the system

partition had less than 500 MB of available space, and the HP Insight Manager

agents indicated two bad memory modules.

It should be mentioned that at this point of my Monday

evening, I began receiving follow-up calls from the Help Desk requesting system

status and to relay much end user angst.

I went with the obvious.

A call was placed to our hardware support group and the bad memory was

replaced by 12:30 AM. I drove to the

data center for this as well, and one more reboot later proved that print jobs still

were not spooling. Openly sobbing wasn’t

appropriate, so I again perused through the event logs. It was still filling with Event ID 9s. Googling on this Event showed it was informational

and could be ignored, although it did look suspicious.

Fast forward to 2:00 AM.

The system had been down for over six hours. Users were very unhappy to say the least, and

panic was beginning to set in. It also

came to light a new hospital system was beginning production in less than five

hours and it had a dependency of, you guessed it, the down system. A roll-back of the Windows updates was

unhelpful, and the functional support person for the application was beginning

to mumble words like Disaster Recovery.

I ran back to my desk and frantically searched for more clues.

It seems typing “Windows printer problem” into a search

engine returns more than 45 million results.

The team lead I awoke at 4:00 AM actually used the phrase “resume

generating event”. That’s never good to

hear. But it was about that time I found

Microsoft Knowledge Base Article 832219. This gem of an article describes a scenario

much like my own. Putting the pieces

together, I discovered that the “Update Rollup 1 for Windows 2000


SP4” updates the PCL Universal Print Driver dll (unidrvui.dll). This, in turn, causes a rebuild of all the

PCL based printer description files (i.e., nearly 1,400 printers in this instance). I realized by studying the System Log events

again that roughly every 9 minutes a printer driver completed updating (Event

ID 20). Multiplying this times 25 PCL

drivers meant the system should be functioning again by 6:00 AM – just in time

for the new system “go-live” one hour later.

The system was spooling print jobs again at 6:05 AM. I notified the Help Desk and performed a

celebration dance in my cubicle while no one was looking. I also emailed an explanation and warning

message to my fellow team members explaining my travails and what not to do in

the future. Sleep was finally an option

by 9:00 AM.

So what can be learned from all of this? Obviously, testing and researching an update

should be required before it is applied to a production system. The reality is many organizations don’t have

the IT staff, hardware or time resources to thoroughly test every released update. Often there exists a catch-22 where it poses a

security risk to not install a patch, but the consequences of installing

without adequate testing are too great.

In the case I described above, updates were applied to the test system

first. But because the test system was a

scaled-down version of production and had 100 printers versus 1,400, the

problem issues weren’t evident.

My guess is many of you reading this are in similar

situations regarding patching production systems. Sound off and let me know! And while you’re at it, check out a different


perspective
on this topic from fellow TechRepublic blogger, Shannon Kalvar.