Different companies place different priorities on the ability to recover production applications in case of a disaster. Some companies are very diligent in their approach. Others are blissfully ignorant of the entire concept. In general, larger companies tend to have more formal processes in place because they have so much more at risk if disaster strikes.
I worked for a large company in the past that tended to be more diligent than most. They had a strong commitment to disaster recovery that has been embedded in the culture for over 20 years. This focus started with the mainframe but has been adapted to cover the midrange and server platforms as well. This article will provide a general overview of the processes this company used to ensure that all business applications could be recovered in case of a disaster.
First, all production applications were classified in terms of their criticality to the business and how soon they would need to be up and running. Class 0 applications were up within one day, class 1 were within two days, and class 2 could wait up to two weeks to get back up. The last classification, class 3, had no time commitment to be recovered. In practice, this simply meant that the IT staff was not going to worry about class 3 applications if a disaster struck, although they technically could be recovered at a later date. However, because there was no commitment in place, it was theoretically possible that they would be unrecoverable. There were a lot of applications that fit this category.
Disaster recovery manuals
Class 0, 1, and 2 applications were each required to have a disaster recovery manual. These manuals had certain defined sections that walked through the process needed to properly back up and recover the application. The manuals could be simple for simple applications, but they could also be complex if the application was complicated.
The point of the manual was to enable someone who was unfamiliar with the application to successfully recover it. Remember, if a disaster struck, we might have to recover the applications at a third-party hotsite. There would not be space or hardware for every developer. So, it was possible that a small number of operations and development staff might have to get everything back up and running, even applications they were not familiar with. The disaster recovery manual would allow them to do that.
The key focus of the process was a mock disaster exercise that was held twice per year.
- The participants. Class 0 applications had to participate each time. Class 1 applications had to be recovered once per year. Class 2 applications participated once every two years.
- Planning. The operations group led the exercise, and each time it was planned well in advance. Two days were set aside for the exercise. A date was picked for the mock disaster, and the various participants were allowed to make sure that they had the appropriate backups for that date available. In other companies, these exercises are held by surprise, but at this company that was seen as too disruptive. So it was planned so that people could build the time into their schedule.
- Recover the infrastructure. The operations people would come in first and start working at the hotsite. Some people would be physically at the hotsite, while others worked remotely. Since the hotsite was shared by other companies as well, the machines were basically blank slates. First the system software, operating systems, and databases needed to be recovered. For a two-day exercise, this might take more than a day.
- Recover the applications. Then, the developers had their turn. If the application was simple, everything might be in place to test the application and ensure that it was working correctly, with all databases synchronized. If the applications were more complex, there might be additional items to recover. This could include manual input from the disaster date going forward, interface databases or files from other applications, files from vendors, etc. However, when all was said and done, the application must be recovered by the time the mock exercise time frame was completed.
- Shutdown. The operations staff was responsible for shutting down the exercise and getting everything back to normal.
A week or so after the exercise, a key learnings session was held with all the participants of the mock exercise. Each person described whether he or she had been successful and discussed anything that should be done differently next time. From an application perspective, any problems were noted by updating the disaster recovery manual and making appropriate changes to backup procedures. My experience was that usually a third of the applications had recovery problems. Examples of problems that needed to be remedied were:
- Files that were not properly backed up. Yes, the operations staff was supposed to back up all production servers, but with hundreds of servers to manage, sometimes proper backups were not being made. This was especially true if a new server was in place or if files were moved from one server to another. For server-based applications, sometimes a critical file was still located on a test server and was not properly moved to the production box. Since the test servers were not a part of the exercise, the file would not be available.
- The data was not synchronized. Sometimes there would be conflicts between files that were not all backed up on the same frequency. If one file was backed up on a weekly tape and another was backed up daily, the application might not behave properly when recovered.
- Interface files were not recovered properly. Your application might be up fine, but it might require a database or file from another application. If that application was unable to recover properly, yours might not either. Remember, it’s an interconnected world we live in.
The company found these exercises valuable. Given the range of problems and gotchas that were encountered every time the test was run, I think it would have been impossible to recover the environment and the critical applications if these exercises were not held at all.
You should also be concerned about your ability to recover your critical business applications. If you do not hold actual mock disasters, at least consider periodic tabletop exercises to ensure that you know what needs to be done when, and how to make sure that you are prepared when disaster strikes. The odds are that it will not happen to you. However, the odds are that it will happen to some of the thousands of people who read this column. Take a page from the Boy Scouts: Be prepared!