In a previous post I detailed our company's transition away from tape to a system combining local disk-based backup with off-site storage. Within the first few weeks, I had to do a couple of file restores. A while later, however, I was forced to use the option of running a complete virtual machine (VM) on the backup server.
It was one of those days when a seemingly innocent change leads to a tale of woe. In my case, it was a SharePoint problem. The details are for another post, but the bottom line was that after I made a configuration change and then reversed it (or so I thought), we lost our intranet site.
Restarting Internet Information Services (IIS), restarting the server, and re-entering credentials all made no difference. Enabling Failed Request Tracing in IIS showed me I had an authentication problem, but I had no idea how to fix it. I didn’t know what file or files were responsible, so I couldn't easily restore something to make it right. I certainly didn't want to resort to a reinstall of SharePoint.
I spent more than two and a half hours trying to fix it before deciding enough was enough. Although it was "only" our intranet, I felt a half day's downtime was about the maximum I could live with. To begin the recovery process, I shut down the problem server.
Off the network
I first wanted to verify that a complete system restore would get me working again, so I logged on to the backup server interface and switched to the Local Virtualization tab. This tab lists the protected servers, and a drop-down enables you to select either the latest VM or a VM based on an earlier backup (Figure A). It also confirms the current resource allocation, and that the VM will not be connected to the external network on startup (it's always the safest way to do it).
Selecting system image from available backups
Having selected the most appropriate backup image, I clicked the Start VM As Of link. When the backup server has provisioned and started the VM, the available menu options change to those in Figure B. To log in to the system, I did the following:
- Clicked Connect To VM Via RDP (which writes a preconfigured Remote Desktop connection file to a location of your choice).
- Double-clicked the connection file. If all is well, this should bring up the Windows prompt for Ctrl-Alt-Del. Since you can't send that key combination directly, I went back to the backup server menu and clicked Send CTL-ALT-DEL.
- Logged in to the VM as normal.
Note: When a backup VM like this first starts, Windows typically installs drivers and asks for a restart. However, my tests have shown that it's not necessary to reboot straight away if you just want to check that applications will run.
I could now verify that the intranet site was running and accessible locally on the backup VM. Whatever I'd broken on the production server wasn’t broken on this image.
Here’s the position I was in:
- My production server (which also happened to be a VM, hosted on a dedicated virtual host) was broken and shut down. Normally this server has a fixed IP address, set by a DHCP reservation.
- My backup server VM was running on the backup server. It was not broken, but it was disconnected from the network.
I’d proved that the backup VM worked OK in isolation. I then decided to connect it to the LAN in order to satisfy myself I could browse to the intranet. Browsing by URL failed, but I soon guessed this was a DNS problem because the backup VM didn’t have the normal, reserved, IP address of the production VM. I found out what IP address it had, browsed to that address, and was able to bring up the intranet.
All this time I assumed that colleagues would be unable to use the intranet because they didn’t know the current, temporary, IP address it was using. But then people began asking me if I’d fixed it because they could see it again. This was because DNS had automatically added a new entry for the new IP address. Although I hadn’t intended other staff to use the backup VM, since they could see it, I decided to leave it on the network but had to post a warning that any updates they made would be lost when the recovery process completed.
Back on the backup server, on the Local Virtualization tab, I clicked Export Image and chose the .VHD format (our VM host runs Microsoft Hyper-V). After a short while, the .VHD file was ready for me to pick up across the network. On the VM host server, I browsed to the file and saved it. Once the copy was complete, I updated the settings of the production VM to point to the .VHD file I’d just copied.
Now it was just a matter of shutting down the backup VM on the backup server and starting the newly-recovered production VM. I was back to where I started with, thankfully, minimal data loss.
Several points to note:
- The backup server’s ability to act as a temporary host for backup server images could be very useful. It’s essential for us because at the moment we have only one host machine. In this case, we didn’t allow staff to make data changes while using the emergency VM; if that had been necessary, our backup service provider would have helped us to preserve those changes and make sure they were included in the exported image.
- Full recovery was made easier by the fact that the production server was virtual rather than physical. It’s really just a case of moving files around the network. Our backup system can also do "bare metal" restores to physical hardware, but I have not tried it out.
- I was glad not be trying to do this from tape, which would have been much slower and possibly unreliable.
Finally, I need to confess that we got into some DNS confusion over the days that followed. That too is another story, but could have been avoided if I’d kept the "emergency VM" off the LAN altogether.
A disk-based backup system can act as an emergency virtual host as well as restoring complete server images to virtual or physical production servers. This improves business continuity by reducing downtime for users and shortens recovery time after a server failure.
Also read Virtualizing the Enterprise, a Special Feature from TechRepublic and ZDNet
Mark Pimperton BSc PhD has worked for a small UK electronics manufacturer for over 20 years in areas as diverse as engineering, technical sales, publications, and marketing. He's been involved in IT since 1999, when he project-managed implementation of a new ERP system, and has been IT Manager since 2008. The first major project he undertook in that role was a second ERP deployment. While still involved in operations, system management, and even a bit of development, Mark is now also responsible for IT risk management. He finds that risk assessment leads to many improvement initiatives, such as a current project to switch from tape backup to disk-based and online backup. Mark is fanatical about documentation, taking special care to record unfamiliar processes. His TechRepublic articles on SSL certificates and PCI DSS compliance are prime examples. Mark is married with two grown-up children.