Several years ago, I worked for a large insurance company as a network administrator. With the job came the usual headaches, but things really got bad when the company instituted a third shift. Someone who knew very little about computers was assigned to the third shift help desk. Every time she got a question that she couldn’t answer—which was several times a night—she would call me at home, and I’d then have to go through the entire troubleshooting procedure over the phone.
At the time, I wished for someone that I could put on third shift to take care of the troubleshooting basics. That way, I would only get a call when the problem was serious, and I wouldn’t have to wonder what had been tried already.
This is where OpalisRobot comes in. OpalisRobot is a scheduling tool on steroids. However, unlike most schedulers, OpalisRobot can actually troubleshoot your system when things break down. It can handle basic troubleshooting tasks and alert you when you need to take a more active hand.
How OpalisRobot can help you sleep at night
The basic premise behind OpalisRobot is that it links events with tasks. What makes the utility so powerful, however, is the huge number of event types and tasks available to you. For example, an event could be a scheduled time, an event log entry, or the stopping of a service. A task could range from starting a service to sending an e-mail message to running a batch file. You can see a complete list of the available objects on the OpalisRobot Object Descriptions Web page.
What makes the OpalisRobot even more powerful is that you can link multiple tasks together in an intelligent manner. The system includes the ability to make decisions based on the current system status. For example, suppose you need to launch an application that depends on some service. You can actually link multiple task objects together in a way that checks to see if the service is running, starts the service if necessary, confirms that the service started, and then launches the application.
It’s easy to use
The first time I read about OpalisRobot, I expected it to be difficult to use. However, nothing could be further from the truth. Using OpalisRobot involves simply dragging event and task objects to the desktop and then linking those objects together in a meaningful way by drawing lines between them.
Each object has an associated properties sheet that you can use to fill in any necessary details. For example, if you were working with a schedule event, the properties sheet would include things like the date, time, and frequency.
While all of OpalisRobot’s capabilities sound interesting, you might be wondering how OpalisRobot can help you troubleshoot network problems in a real-world environment. Earlier I recalled my wish for a troubleshooter on the third shift, who would only wake me up for the most serious issues. OpalisRobot can do that.
In my previous job, the company entered data 24 hours a day into a proprietary database, which depended on several services. If a service stopped, the database would stop working. We were usually able to simply restart the service, but that often meant rebooting the server.
A service would occasionally stop due to low disk space as well. Let’s take a look at how OpalisRobot can help in this kind of situation.
I’d begin by setting up an event to monitor hard disk space. For example, I could configure OpalisRobot to launch an event if the available hard disk space dropped below 100 MB. I’d then link quite a few task objects to the event object. These task objects would be designed to do a disk cleanup.
Once I’d built a chain of events for disk cleanup, I’d set up an event to check the amount of remaining hard disk space. This is necessary because you don’t really know how much space your tasks would be able to free up. For example, if your disk space threshold is 100 MB, but after cleaning up the disk you have only 101 MB of free space, it’ll only be a few minutes before the event is triggered again. When the event is triggered again, it isn’t going to free up significantly more disk space because it just tried the procedure moments earlier.
Therefore, I’d build a process into my series of tasks to have OpalisRobot call me if the disk cleanup were ineffective. That way, I could take action before the database went down. OpalisRobot has a telephony module that can call a phone number and play a prerecorded message. Since I’m probably asleep at 3:00 A.M. when the low disk space event would inevitably occur, it would make much more sense to have OpalisRobot call my home rather than sending me an e-mail message, which I wouldn’t see until I checked my messages hours later.
Now that we’ve tried to prevent low disk space, let’s look at those services. When a service stops, it usually makes an event log entry. Therefore, you can use the event log entry associated with a service stopping to trigger the chain of events necessary to correct the problem.
In such a situation, the first thing I’d do is send a pop-up message to everyone who uses the application and to the help desk. The message would state, “The database is temporarily unavailable. We are aware of the problem, and service will be restored momentarily.” Hopefully that would stop anyone from waking me up with a troubleshooting call.
Next, I’d have OpalisRobot simply restart the service. If the service didn’t then restart correctly, I’d have OpalisRobot disconnect all of the users and reboot the server. Once the server had rebooted, OpalisRobot would check the service status. If the service had started, OpalisRobot would send another pop-up message telling everyone that the database was back up. If the service hadn’t started, I’d then use a module similar to the one that I described earlier to call me at home—probably at 3:00 A.M.!
Performance monitoring with OpalisRobot
There are countless articles on the Internet about performance monitoring. Many do a good job of explaining which counters you should watch and why. There’s just one problem with most of these articles, though: Instead of using the Performance Monitor to prevent catastrophes, many administrators (myself included) tend to use it to diagnose what’s already gone wrong. This is another place where OpalisRobot can make your life easier.
It’s easy to think of an event as a timestamp. For example, if you scheduled your backup to run at midnight, then the clock hitting midnight would be the event, and the running of the backup would be the task. However, OpalisRobot has other events besides time events. You can actually configure OpalisRobot in such a way that a Performance Monitor counter hitting a specific value qualifies as an event. You can then trigger an action based on that event.
One of the most common system monitoring practices is to monitor CPU usage. According to Microsoft, it’s normal for the CPU usage to spike to 100 percent, but the level of CPU activity shouldn’t stay above 80 percent for any length of time. Based on this guideline, it would be easy to assume that you could create an event that triggers a task or series of tasks when the CPU activity level is greater than 80 percent.
While this is certainly possible, there are some problems with doing it this way. The biggest issue is the CPU spikes above 80 percent off and on all day. Just because you have a high spike in processor activity doesn’t mean there’s a problem. However, if the level of processor activity stays above 80 percent continually for more than a couple of minutes with a normal workload, that might very well signal a problem.
So how would you go about solving this sort of problem with OpalisRobot? The first step is somewhat obvious. You must create an event that detects when the counter associated with the CPU’s workload is above 80 percent. However, when the CPU’s workload does exceed that threshold, you don’t necessarily want the system to act immediately. After all, the event could relate to one of those harmless spikes in activity that Windows is so well known for.
The easiest way of preventing the system from acting too quickly is to set the event object to check the CPU counter every five seconds or so. By not checking the counter every second, you reduce the overall workload on the machine, which reduces the chances that OpalisRobot is actually causing the performance issue. At that point, it will be safe to assume that the triggered event represents five seconds of high CPU activity. In actuality, the activity might have been high for one second, three seconds, five seconds, and so on.
The next step is to establish a counting mechanism. You can do this with a counter object. Since the event represents a five-second time slice, we’ll configure OpalisRobot to add five seconds to the counter each time the event is triggered.
Next, you’ll need to set up a second event to check the value of the counter every five seconds. If the counter reads 60, then the CPU activity will have been high for 60 seconds. If the counter reads 120, then the CPU activity will have been high for two minutes. The idea is that when the counter reaches 60, 120, or whatever value you set, an event will be triggered. You should link the event to another task that resets the counter to zero. This allows the process to start all over again. That way, you’ll be notified every time the CPU activity has been too high for too long.
Setting tasks after a performance event is triggered
Once you’ve set OpalisRobot to trigger an event under the conditions you’ve set, your next step is to set up some sort of alert. This alert can be in the form of an event log entry, e-mail message, pop-up message, and so on.
Our counter is designed to increment by five seconds every time CPU activity is above 80 percent. The problem is that there’s currently no provision for when these spikes in activity are consecutive. If you left the task as is, a spike would increase the counter to five. Another spike an hour later would increase the counter to 10. The counter would eventually reach your threshold value whether the system was having any real performance problems or not.
To get around this problem, you need to set up one more event. This time, you’ll monitor the CPU counter every five seconds, but if the CPU activity level is below 80 percent, the counter will be reset to zero. This way, only consecutive instances of high CPU activity will increase the counter. The instant that the CPU activity level drops below 80 percent, the counter will reset, and the entire process will start all over again. You’ll only receive a notification if the CPU activity level has been too high for too long. You can see a flowchart of the entire configuration I’ve just described in Figure A.
|You can send a notification if the CPU activity is too high for too long.|
Use OpalisRobot to keep your network running smoothly
I’d like to take the concept of system monitoring a step further by showing you a couple of techniques you can use to keep your network from running into trouble.
Responding to e-mail
Earlier I explained that OpalisRobot generates an alert when a service stops and uses that alert to send an e-mail message and to launch a series of processes designed to address the triggering event. However, there are variations of this technique. You can configure OpalisRobot to respond to an e-mail message. For example, suppose you configured the system so that if a particular service stopped, an e-mail message would be sent to a specific mailbox. You could then set OpalisRobot to check for new mail any time the specified service stopped. By doing so, OpalisRobot would essentially be reading the message that it just sent.
Once OpalisRobot checks the mail, you could create a filter that performs specific actions based on the message’s subject line. Of course, having a single event to spawn multiple e-mail messages is one way to expand on this idea. For example, you might design an e-mail message to get OpalisRobot to restart a service, while another might be set to write an event to the log. A final e-mail message could send an alert. Sending multiple e-mail messages with multiple subjects makes it possible to spawn multiple tasks. You can see an example of this particular setup in Figure B.
|OpalisRobot can send and react to multiple e-mail messages.|
Earlier I explained that you could use OpalisRobot to keep tabs on your hard disk space and to react to low disk space by cleaning up the disk. Deleting temporary files is a great way to reclaim some disk space, but what if there are no temporary files to delete?
One way to solve the problem is to compress seldom-used files. For example, you can have users place all old files that must be kept, but that are seldom used, into an archive directory. You can then have OpalisRobot monitor the number of files and the size of the files in this directory. Once the threshold values that you’ve specified are exceeded, you can have OpalisRobot compress the files into a ZIP file and then delete the original files.
Before you jump in with both feet and do this, though, you’ll want to take a few precautions. First, you probably don’t want every file compressed and archived. For example, you probably wouldn’t want OpalisRobot to compress ZIP files and roll them into another ZIP file. You’d end up with a bunch of nested ZIP files. But you can configure OpalisRobot to compress only certain types of files. For example, you might compress .doc or .xls files.
You can also configure OpalisRobot to test the ZIP files that it creates. If the ZIP file checks out fine, you can have OpalisRobot delete the original files. If the ZIP file has a problem, you can configure OpalisRobot not only to leave the original files alone but also to send a failure notification to you. You can see an example of this entire process in Figure C.
|OpalisRobot can automatically compress archived files.|
Careful with e-mail notifications
The e-mail used by OpalisRobot looks at the inbound message’s subject line to launch an event. However, I recommend taking things a step further and having OpalisRobot look not only at the subject line, but also at the sender. For the sake of security, I also recommend configuring OpalisRobot to forward any messages that it receives from someone other than designated administrators.
If your e-mail system uses Exchange, I strongly recommend taking the OpalisRobot e-mail account out of the Global Address List (GAL). Having this address in the GAL opens you up to all sorts of unwanted e-mail. For example, if someone were to send a message about the company picnic to all recipients, OpalisRobot would also get the message if it were listed in the GAL. OpalisRobot would then detect that the message came from an unauthorized user and would forward the message to you. Instead of getting a security breach report, you’d get a second copy of the message about the picnic.
Sleep peacefully—OpalisRobot is on duty
With careful configuration, OpalisRobot’s powerful features can take care of most common troubleshooting tasks, leaving you free to get a good night’s sleep. With OpalisRobot on duty, you’ll have a better idea of what’s involved and what’s happened when you do get that 3 A.M. wake-up call.