Disaster recovery is a term used to describe the ability to restore your IT-supported critical business processes in the event of a disaster. It is a specific IT portion of a more general process called business continuity planning, which ensures that the entire business can continue in the case of a disaster. Years ago, we pretended that a nuclear bomb might explode on the company headquarters. With the end of the cold war, we now talk about the more mundane, but real, possibilities of fire, flood, and terrorism. In one company I worked for, the disaster scenario was that an airplane crashed into the data center. This might seem farfetched, but that company’s data center was across the highway from the airport. So you never know.
Now, you might be thinking that it’s up to the data center and the operations staff to back up all the data that is needed to recover the application. That may be true, but it’s not necessarily the whole story. Being able to recover critical business applications in the event of a disaster requires a coordinated effort from operations, development, and the business users. The operations staff may be backing up the servers, but they do not understand the business data, what it means, and how it fits together. In fact, they don’t care—it’s all just bits to them. The developers responsible for supporting the applications need to care. You have been entrusted with the IT responsibility for the application, and you must ensure that it can be recovered in case of a disaster. Remember—operations will recover the data, but the developers need to recover the fully functioning application.
Seven actions you can take
From a development standpoint, here are the types of things you should do to ensure that you can recover the critical business applications to keep the business running:
- Make sure all the data is backed up and sent offsite. This may not be as obvious as it seems. Look at each application and think about what files it needs. The servers might be backed up, but is there data on the client machine? How is it recovered? If you utilize tape files, are the tapes backed up and sent offsite? I worked at a company where critical billing tapes were backed up nightly, weekly, and monthly. However, the backups were kept in the same data center as the originals. That would not have helped us in case of a disaster.
- Be sure you can synchronize all the files. This is a critical point to understand. Many applications have multiple databases, sometimes spread among various servers. In a distributed environment, you may also have databases that are processing transactions independently from one location to another. The question is whether you can recover the various parts to make an integrated whole. Do you have some servers backed up weekly and some daily? Check out all these frequencies to make sure that you can get all your files back in sync.
- Check your ability to receive, or recover, data from your trading partners. Do you process files from other applications or from other companies? If you have to recover, will you be in sync with the feeder systems? Can you request a duplicate file from a vendor if you have to recover to a point that’s a day or a week in the past?
- Understand the point in time you would recover to. The developers who are responsible for the task need to know what point in time the recovery is effective. For instance, if the data is backed up on the weekend, and a disaster occurs on a Friday, then you may have lost up to six days of processing. If all of your data is backed up nightly, you may lose only a day. Also check with the operations staff to ascertain which backup tapes are actually sent offsite. In some companies, the most recent backup tapes are kept onsite in case they are needed to recover a specific server. It may be the second-most-recent backup that is sent offsite. If, for example, your data is backed up nightly, you may find that your recovery tapes are two days old. The most recent backup may be onsite and destroyed in the disaster.
- Try to recover forward from the backup date. Once you understand the point in time that your recovered system represents, see if you can recover from there to the time of disaster. For instance, let’s say you back up your financial databases on a nightly basis. A fire in your data center occurs at 4:00 P.M. the next day. Can you recover the data and transactions from the current day? Hopefully, you can get close. For instance, this may require the main databases to be backed up nightly, with the transaction log files backed up hourly. Communicate to your business clients what their disaster risk is. It may not be practical to recover to the exact point in time of the disaster. In that case, the business will be responsible for manually recovering what was lost from the current day.
- Make sure you can fully recover in a reasonable time frame. Being able to recover an application up to the prior night is great, but it loses much of its value if it takes four days to recover the application. Again, understanding this question requires collaboration with the operations folks. Where is your backup site and how quickly can you be up and running there in the event of a disaster? If you don’t have a backup site, how long will it take to acquire new hardware? How will communications be restored? How long will it take to retrieve the proper backups from the offsite facility and have them restored? You need to think through all this to make sure that you can recover in a reasonable time frame.
- Understand the priority of applications to be recovered. If a disaster occurs, the scene will be chaotic. If your company has dozens (or thousands) of applications, do you know what the priority is of recovery? If not, you should work with the business to classify the applications. Those that are most critical to the business should be focused on and recovered first. Don’t spend your time recovering a monthly reporting system before you have your critical billing and accounts receivable systems up. In fact, you may determine that some applications are critical, while others do not need to be recovered at all.
Plan and test for disaster
After thinking through these areas, you should create a disaster recovery plan for each important application. The plan is needed because five years from now, you may not be around, and the person who has to recover the application may have a lot less knowledge than you do. The plan should describe how to recover the application, key emergency contacts, minimum hardware requirements, etc.
Lastly, you should recommend that your company plan a disaster recovery test at least once a year. During the mock exercise, you can work with the operations folks to retrieve the backups and try to recover the applications. When you are finished, have the business people work with the application to see if it works as it should. Is the data synced up? Are the reports accurate? If you can recover in a mock disaster, you should feel comfortable that you could do it in a real emergency.
Are you up to the task?
Here’s my challenge to you: If you have a set of critical applications with any kind of complexity, I’ll bet you will not be able to fully recover more than half of them in a reasonable time frame. Take the challenge. Then you’ll know what your business risk is. In my next column, I will describe a case study of how one large company planned and tested for potential disasters.