Monday afternoon began peaceful enough. Being on-call for my daytime IT job means occasionally
carrying a pager and cell phone for a week of 24/7 paranoia. It also means I stay as late as needed to
perform scheduled server reboots, which are typically preceded by applying the
latest Microsoft Security Updates and Hotfixes.
Sounds simple. But on this day, I
had a particular Windows 2000 Server that seemed inclined to remind me it was,
in fact, a Windows box. And, oh yeah, it
was after all a Monday.
Microsoft made it easy.
A quick connection to the Windows/Microsoft Update site revealed a
whopping 24 updates were needed and recommended for immediate
installation. To my credit, I was smart
enough to first install the updates on the designated test server, which
revealed no evident issues. And then, at
approximately 6 PM, I casually clicked the Install
Updates button, and waited for thirty minutes while the installation
completed. One reboot later revealed a
fully functioning system with no warning or error messages in the system event logs. I could go home and enjoy my evening
But, no. Let me
explain. Arriving home to find your
spouse speaking with your employers Help Desk is never a good sign. It seemed no one located at any of our five hospitals
could print their much needed patient labels and medication barcodes following
the reboot. This ONE Windows server functions
as the sole print server for all 1,400 printers at our remote sites. (Note: I only rebooted it, not designed
I hadnt panicked at this point. So as my family ate dinner, I connected to
the server via pcAnywhere and thought a quick restart of the print spooler
would coerce the printers to fulfill their lifes purpose. A subsequent look at the printer queues
revealed no jobs were queued and nothing was being printed. The Windows System Log showed a series of
Informational Events; Event ID 9, Printer xxx was set. In fact, there were only Event ID 9s in the System Log.
That was because the maximum size of the log was set to 1 MB which made
it appear useless. Further investigation revealed the system
partition had less than 500 MB of available space, and the HP Insight Manager
agents indicated two bad memory modules.
It should be mentioned that at this point of my Monday
evening, I began receiving follow-up calls from the Help Desk requesting system
status and to relay much end user angst.
I went with the obvious.
A call was placed to our hardware support group and the bad memory was
replaced by 12:30 AM. I drove to the
data center for this as well, and one more reboot later proved that print jobs still
were not spooling. Openly sobbing wasnt
appropriate, so I again perused through the event logs. It was still filling with Event ID 9s. Googling on this Event showed it was informational
and could be ignored, although it did look suspicious.
Fast forward to 2:00 AM.
The system had been down for over six hours. Users were very unhappy to say the least, and
panic was beginning to set in. It also
came to light a new hospital system was beginning production in less than five
hours and it had a dependency of, you guessed it, the down system. A roll-back of the Windows updates was
unhelpful, and the functional support person for the application was beginning
to mumble words like Disaster Recovery.
I ran back to my desk and frantically searched for more clues.
It seems typing Windows printer problem into a search
engine returns more than 45 million results.
The team lead I awoke at 4:00 AM actually used the phrase resume
generating event. Thats never good to
hear. But it was about that time I found
Microsoft Knowledge Base Article 832219. This gem of an article describes a scenario
much like my own. Putting the pieces
together, I discovered that the Update Rollup 1 for Windows 2000
SP4 updates the PCL Universal Print Driver dll (unidrvui.dll). This, in turn, causes a rebuild of all the
PCL based printer description files (i.e., nearly 1,400 printers in this instance). I realized by studying the System Log events
again that roughly every 9 minutes a printer driver completed updating (Event
ID 20). Multiplying this times 25 PCL
drivers meant the system should be functioning again by 6:00 AM just in time
for the new system go-live one hour later.
The system was spooling print jobs again at 6:05 AM. I notified the Help Desk and performed a
celebration dance in my cubicle while no one was looking. I also emailed an explanation and warning
message to my fellow team members explaining my travails and what not to do in
the future. Sleep was finally an option
by 9:00 AM.
So what can be learned from all of this? Obviously, testing and researching an update
should be required before it is applied to a production system. The reality is many organizations dont have
the IT staff, hardware or time resources to thoroughly test every released update. Often there exists a catch-22 where it poses a
security risk to not install a patch, but the consequences of installing
without adequate testing are too great.
In the case I described above, updates were applied to the test system
first. But because the test system was a
scaled-down version of production and had 100 printers versus 1,400, the
problem issues werent evident.
My guess is many of you reading this are in similar
situations regarding patching production systems. Sound off and let me know! And while youre at it, check out a different
perspective on this topic from fellow TechRepublic blogger, Shannon Kalvar.