Networking

Providing DHCP fail-over in Windows NT, part 1

Richard Charrington explains how, with little or no intervention, you can provide continuous DHCP services on your network—even in the event of a DHCP failure. You can use this process to free up the DHCP server for support or maintenance tasks without interrupting the DHCP service.
Providing DHCP fail-over in Windows NT, part 1
Suppose that your live server crashes and you can’t recover it before leases start to expire and users can’t get an IP address. In this Daily Drill Down, we’ll describe a method of providing a continuation of the DHCP service in the event of such a failure. You also can use this method to move the DHCP service so that you can work on the live server (for example, to install a service pack or hot fix). This method has been tested on two sites and works correctly, but it has been tested only on Windows NT Server 4.0 SP4. In this Daily Drill Down, we’ll explain the process and highlight any “gotchas.”

Requirements
The initial requirements include one NT server (the live server) that runs a fully configured and working DHCP service and another server (the standby server) with the DHCP service installed but not running. Both servers should be running Windows NT 4 SP4 or later.
The live server must be configured to back up every 15 minutes (or however frequently that you require). Set the backup time by changing the Registry key. The location for the backup and the backup interval are held as values in the Registry keys, as shown in Table A.
Table A
HKEY_LOCAL_MACHINE HKEY_LOCAL_MACHINE
\SYSTEM \SYSTEM
\CurrentControlSet \CurrentControlSet
\Services \Services
\DHCPServer \DHCPServer
\Parameters \Parameters
\BackupDatabasePath \BackupInterval

Methods
To allow for rapid recovery from an active server failure, it’s necessary to keep a recent copy of the DHCP database on the standby server. It’s not as straightforward as it sounds—the structure and contents of the DHCP directory have to be correct, or the DHCP service will fail to start.

Microsoft’s method
Microsoft documented the procedure for moving DHCP between machines in Microsoft Windows NT Server Networking Guide (in the chapter entitled “Managing Microsoft DHCP Servers”) and on TechNet. It’s important to note that the Microsoft documentation contains several significant factual omissions and errors. Microsoft’s documentation suggested that the DHCP service be configured to back up to a subdirectory in %winroot%\system32\repl\export. It also recommended that NT replication be used to move the DHCP database to the standby server.
The problem with this method is that it’s impossible to prevent the DHCP backup from occurring during the replication or to prevent the replication from taking place during the DHCP backup. This situation could result in a corruption of the backup DHCP Jet database. If a corrupted database is copied over to the standby server, the DHCP service won’t work properly.Microsoft recently acknowledged the mistakes in its documentation. The company has suggested using Robocopy and a different backup directory to copy the files to the standby server.Microsoft’s fail-over process entails several manual steps. You will need to refer to a documented procedure to ensure that you follow the steps in their correct order.
The working method
The method described here uses Robocopy, as suggested in the revision to Microsoft’s original documentation. However, it goes well beyond the method suggested by Microsoft and allows you to automate all the required actions.
With a little bit of preparatory work and a good script, it’s possible to carry out a fail-over by stopping the DHCP service on the live server and starting the DHCP service on the standby server. In case the live server crashes, it’s just a matter of removing the crashed server from the network to make sure that its IP address isn’t still “live” and then starting the DHCP service on the standby server. In either case, it’s possible for the fail-over to be handled automatically.
Step-by-step process
The following steps define the process required in order to make sure that the standby server can take over the provision of DHCP services by starting the service:
  1. On the live server, create a copy of the DHCPServer Registry hive (using Regdmp).
  2. Copy this file to the standby server.
  3. Duplicate the DHCP backup directory from the live server onto the same point on the standby server (using Robocopy).
  4. On the standby server, copy the files in the {backup}\jet\new directory to the %winroot%\system32\dhcp directory.
  5. On the standby server, import the Registry hive into the Registry (using Regini).
It’s possible for a bad network connection, a busy server, or heavy network traffic to slow down the process of copying the DHCP backup directory between servers. It could result in one of the files being open when the DHCP service on the live server starts the backup process. To guard against this, you can add these steps:
  1. On the live server, create a copy of the DHCPServer Registry hive (using Regdmp).
  2. Copy this file to the standby server.
  3. Copy the DHCP backup directory from the live server into another directory on the live server (using Robocopy).
  4. Duplicate this backup directory from the live server into a directory on the standby server.
  5. On the standby server, copy the directory structure in this backup directory to the DHCP backup directory.
  6. On the standby server, copy the files in the {backup}\jet\new directory to the %winroot%\system32\dhcp directory.
  7. On the standby server, import the Registry hive into the Registry (using Regini).
You can build a batch file to carry out these steps, and at the click of a button (metaphorically speaking), you will have a server that’s ready to take over the provision of the DHCP service if the live DHCP server fails. However, to ensure that your standby server is current, you’ll need to run your batch file every time the DHCP service backup finishes.One other step needs to be considered. When the standby server takes over the provision of DHCP services, client PCs with DHCP leases that subsequently expire will attempt to renew automatically from the DHCP server that originally provided the lease. These attempts will fail because that server (identified by its IP address) will not respond. However, on the new live server, the DHCP service will release that IP address, and it will become available for use by any other client requesting an address. Should that IP address be reissued, an IP conflict will occur.Therefore, you may decide that any fail-over should include this final step: swapping names and IP addresses so that the live DHCP server always has the same name and IP address. However, keep in mind that it’s only a potential problem and will occur only if the client PC is not rebooted and the user does not log off between the fail-over and the expiration of the lease. Any such issue can be resolved quickly by using Winipcfg (on Windows 9x) or Ipconfig (on Windows NT) to force a release and renewal of the client’s IP address.
Automation
Automating this process will ensure that the standby server is ready to take over at any time. Use the At command (or Winat, if you prefer the GUI version) to schedule your batch file to run as often as the DHCP service runs its backup process. You’ll need to make sure that your batch file runs just after the DHCP service backup has finished.
You can find out when the last DHCP backup occurred by looking at the time stamp on one of the files in the backup directory, and you can schedule the batch file to run a few minutes later. If the DHCP backup runs every 15 minutes, you would have to use the At command to schedule 96 instances in order to cover a 24-hour period—a mind-numbing prospect and not very elegant.A better solution would be to use the Soon command (which is found in the NT Resource Kit) and have the batch file schedule itself to run again 15 minutes after it starts. This approach will work for a while, but such a method is not accurate. The schedule will drift by a few seconds on each run and will clash eventually with the DHCP backup. As with the replication method, it may result in a corrupted DHCP database on the standby server.To avoid this problem, you can calculate when the next DHCP backup will occur—add the backup frequency to the time stamp on a file in the backup directory—and schedule the batch file to run halfway between that backup and the DHCP backup that follows. With NT command extensions, the Set command can carry out simple arithmetic, but it doesn’t extend to adding two times together; thus, getting this command to work will require some ingenuity.
Testing
Once you’ve written and run your batch file a few times, you should test a fail-over. You’ll want to run a test by stopping the DHCP Server service on the live server and starting the DHCP Server service on the standby server.
The first time that you try this switch, you may find that the DHCP service won’t start on the standby server. This failure to start can occur if the servers have been running the DHCP service for some time prior to introducing this new process. If it fails to start, restart the service on the live server, reboot the standby server, and run the test again.After testing the fail-over one way, make sure that it works the other way, too. I suggest waiting several days before the second test so that plenty of changes will have been made. Again, you may have to reboot the server if the service fails to start. In my experience, however, the server reboot is required only once.
Gotchas
On the standby server, you’ll need to set the startup options for the DHCP Server service to Disabled or Manual so that the service will not start automatically if the server is rebooted. The service startup properties are held in the Registry. Every time you import the DHCPServer hive into the standby server Registry, it sets the service startup option to Automatic—as it was on the live server. Therefore, you need to include a line in your batch file that will set the option back to Manual or Disabled after the hive has been applied.
When you test the fail-over, remember to stop the batch file that’s running on the live server and to schedule it on the standby server once the fail-over has completed successfully. If, like me, you automate as much as you can, you’ll make the batch file able to “recognize” when it’s running on the live server and when it’s running on the standby server. If you do, make sure that, when the standby server becomes the live server, the time calculated for the next run of the batch file is not in the past.How can this problem occur? Consider the following scenario: At 12:30, the DHCP backup occurs. Assuming a 15-minute backup frequency, your batch file will run at 12:37. At 12:40, you run a fail-over. Now, calculating the next run as
{time of last backup} + (1.5 * {backup frequency})
results in
12:30 + (1.5 * 00:15) = 12:52
So, the batch file runs at 12:52. However, it’s now on the standby server. The DHCP service started at 12:40, the time of the fail-over; therefore, it will not back up until 12:55. When the batch file calculates the next scheduled run, it comes up with 12:52 again. Since it’s just past 12:52, the batch file will be scheduled for 12:52 tomorrow, and it won’t run for another 23 hours and 59 minutes.One solution would be to change the time of the file you are checking at the time that you run the fail-over process. Another solution would be to check to see if the time for the next scheduled run is in the past and, if it is, to add another 15 minutes to it.There is no need to worry too much about this situation. It will become a problem only if you need to fail back to the previous server before the same time on the following day.
Conclusion
Using a few logical steps, it’s possible to automate the process of keeping a server in standby readiness to take over the DHCP service at a moment’s notice and with little or no manual intervention. Once such a process is in place, it’s possible to use it to free up the DHCP server for support or maintenance tasks without interrupting the DHCP service. It even can be done during normal working hours without affecting the clients. In part 2, we’ll provide the code for implementing this process.
Richard Charrington’s computer career began when he started working with PCs—back when they were known as microcomputers. Starting as a programmer, he worked his way up to the lofty heights of a Windows NT Systems Administrator, and he has done just about everything in between. Richard has been working with Windows since before it had a proper GUI and with Windows NT since it was LANManager. Now a contractor, he has slipped into script writing for Windows NT and has built some very useful auto-admin utilities.The authors and editors have taken care in preparation of the content contained herein, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.
0 comments

Editor's Picks