Data Centers

Support diary: Tony Geraghty (Thursday)

Read about overloaded CPUs and WAN problems in today's entry.

It’s one of “those days” for Tony.

Read Monday’s entry.

Read Tuesday’s entry.

Read Wednesday’s entry.

8:05 A.M.
Everything checks out okay in the computer room, and it’s down to work for another day. I grab a cup of coffee on my way back to the office.

One of the users in the UK calls to inform me that he was doing some work at home and had inserted a disk containing a Word document that caused Dr Solomon to report another Word macro virus. Fortunately, he was experienced enough to run a complete system scan last night and called me this morning before plugging into the network. I talk to him about the dangers inherent in using floppy disks, but there's really not much that can be done about the problem. Once again, the importance of maintaining up-to-date virus signatures is emphasized by a real-world example.

8:45 A.M.
One of the remote sites calls to let me know they cannot log on to the HP 9000 here in head office. A little troubleshooting with a user there leads to a router with no power. As this particular site is about five miles away, and on the route I will be taking to visit another site today, I decide to pay a visit to see things for myself.

9:10 A.M.
My colleague in the UK is today visiting the site with the e-mail problem from last night, so I talk to him about the mail server. As he's waiting for an equipment delivery that has been delayed by heavy fog, and is at somewhat of a loose end, he takes control of the situation.

I'm running Glance on the HP 9000 to monitor CPU, disk, memory, and network access. Some weeks ago, I configured the X server on my Linux box to run Glance. I tend to leave the Linux box logged on as a low-level user I specifically set up to run Glance, with the graphs generated by Glance always visible.

I notice that CPU activity is unusually high, hovering on or about 95 percent. Glance tells me the offending PID, and I log on to the HP 9000 to identify the process. It’s a process that started yesterday and appears to be a report run that didn't complete. I verify that the process owner isn't running anything related to this, and kill it. The CPU usage graph immediately drops back to a reasonable 15 percent.

10:30 A.M.
It’s going to be one of those days. The logon problem the NT server in the UK was having at the start of the week has reoccurred. It seems obvious now that this was not just an overdue reboot but possibly the onset of a more serious issue. I'll have to take a close look at the system and try to find the problem.

11:45 A.M.
I'm still no closer to solving the logon problem. The only clients affected are the few remaining Win 95 users. The only reference to a similar problem I can see on the MS KnowledgeBase CD suggests the problem may be caused by having two or more common network protocols installed on both the server and client, and the server is having difficulty responding to simultaneous logon requests from both protocols. This doesn't help much, as both server and clients are running TCP/IP exclusively. I've tried reinstalling the protocols on the client side, but without any change. I can't do anything to the server until during lunchtime, as I don’t want to inconvenience any NT users who can log on.

1:45 P.M.
The problem seems to be solved. Reinstalling the TCP/IP suite and the service pack has resulted in everyone on the LAN being authenticated again. I've no idea what caused the problem, and it galls me that I could have solved it sooner had I just taken down the server in the first place. I guess the only consolation is that only the users using Win95 experienced a disruption in service, rather than the entire LAN had I taken down the server any earlier. I shall have to monitor this machine and ensure the problem doesn't crop up again.

The obvious answer would have been to promote a BDC while I work on the PDC, but unfortunately there isn't one! The NT domain structure has only recently been put in place at this site, and a BDC isn't scheduled to be installed just yet. The entire WAN is undergoing large-scale restructuring, and it was just a matter of time before a problem of this nature reared its head.

When I commenced employment, the WAN consisted of a bridged peer-to-peer network spanning all nine permanent offices in Ireland and the UK. We've gradually begun to rollout a routed TCP/IP-only NT network. The overall plan calls for an NT domain in each office. So far, there are five domains in place. Our next major installation is the replacement of the existing leased line infrastructure between the sites with a managed frame relay network. I'm counting the days until this happens!

3:20 P.M.
One of the users rings to tell me he can't print using the application on the DRS6000. I clear the /var filesystem, and he's good to go again. I finally spend a few minutes writing the script to do this automatically.

It looks like I'm not going to make that scheduled visit to the nearby sites, so I ring to let them know it'll be tomorrow morning. They aren't very happy, but the NT problems in the UK have put me behind schedule by more than three hours, and in the meantime, some other issues have been developing that need my attention.

One of the remote offices manufactures bitumen and other building materials and must comply with local air emission laws. The PC setup to monitor emission levels has developed some sporadic problems. Every once in a while, it crashes and stops monitoring the output of the manufacturing plant.

Since this could easily escalate into a pretty serious breach of Irish law, it becomes my number one priority for this evening. Fortunately, the remote site is quite close so I'm there within a half hour to take a look at the problem.

5:10 P.M.
It would appear that certain other software running on this PC had developed a fault and was stalling every few hours, causing the machine to finally crash. Reinstalling the offending software seems to solve the problem, but I will monitor the system closely for a few days.

I finish the day by beginning the install of a new system for another new hire. We seem to be hiring new staff at an incredible rate these past few months.

By 6:20, I'm finished with the new machine. The new user won't be starting until Tuesday next, so there is no hurry to physically install the machine. Time to go home and recover from a pretty intense day.
To share your thoughts about this diary entry, please post a comment below or follow this link to write to Tony .

Editor's Picks

Free Newsletters, In your Inbox