It’s hard to imagine a worse day for a server’s hard drive to crash than the last day of the fiscal year. When this happened to my company, we couldn’t close the books for month-end or year-end until the systems were fixed. Reaching for our disaster recovery plan, we found it lacking updates and adequate recovery programs. To make it worse, the plan was dependent on people who had left the company months earlier. It took us two days to get the system back up.
To prevent the next disaster from becoming a tragedy, we created a new set of procedures and introduced an online intranet that would allow all staff to access and update recovery procedures. Here’s how we did it.
Second of three parts
This is the second in a series on what one company did following the crash of a hard drive on a critical server. The first article, “Make sure your disaster recovery plan will work when it's needed,” detailed the company’s hard-learned lessons.
Develop an intranet
We knew from the crash that the following items needed to be tracked:
- Listings of all financial production system files
- Determinations of whether files were critical
- The update schedule
- The name of the person responsible for a file
- What was necessary to fix a file
- The estimated time of repair
With these points in mind, we started our new system by developing a very simple flowchart, as shown in Figure A.
This flowchart showed all of the important steps needed to get the production system and key files back online. From this flowchart, we derived a prototype of an online intranet that we would develop solely for users to access to get the system back online as quickly as possible. With the prototype, we placed the emphasis on how the system would function from the user perspective, not the back-end functionality that the users wouldn’t see.
The prototype consisted of an online system that all developers and managers who would be affected by a system crash could access. It was a single Web site that would contain all of the information needed to assess the severity of the problem—such as which files are corrupted, how long it will take to get them back online, what needs to be done to fix the files, etc. The prototype emphasized why each data element was chosen.
- File Name: The files in all of the systems
- Owner: The programmer responsible for the file
- Team: Which team would be responsible if the owner was not available
- Fix status: How bad the file was corrupted, whether it would be easier to restore it, or whether the file index was bad
- Time to completion: How long it takes to rebuild this file
- Critical: Whether the file critical is or can be ignored for now and fixed later, for example, a log file isn’t critical and would be fixed after the primary data files are restored
- Online update: Whether the file was being updated online when the system crashed, or was updated during a nightly batch run
- ISAM check: If an ISAM file, whether we should run an ISAM verification utility against it
- Batch update: Which programs update this file in batch mode
- Notes: Programmer notes about certain files and programs, for example, programA should be run prior to running programB every time this file is corrupted
We then laid out the prototype to give management an idea of what we were planning to develop. Figure B shows an example of the interface that the users will access online. It illustrates in a Web site format which data is captured based on what we discussed above.
By using a prototype such as this, the managers and users will know exactly what they will be getting and how it will work before the actual system has one line of code written.
From the prototype, you can see what critical elements will be accessed. The first column lists the filenames and where they're physically located on the drives. If a user clicks on this file, a window will pop up containing any notes or specific procedures that will need to be run to correct the file.
The Owner column has a drop-down box that lists all of the programmers in the department. The users of the system will choose which programmer is responsible for this specific file.
The Team column has a drop-down box showing which team is responsible for this system. In this case, the drop-down box has Order Entry team, Financial team, and Payroll team.
The next two columns—Fix Status and ETA—would be used by the programmer in the recovery mode to relay to managers the status of the files. For example: Do they have bad data or a bad index, or has the file correction been completed? All of these options will be in a drop-down box that the programmer can choose from.
ETA will be a guess, based on the programmers' experience, of when the file will be fixed and ready to go.
The last three columns, Critical, Online Update, and ISAM Check, are check boxes that will be checked if this specific file is critical, updated online, or needs to have a ISAM check run.
With this Web-based system, managers can see from their desks the status of the recovery process taking place by just watching the ETAs and file statuses. A programmer can use the system to see what needs to be done to correct the files and who should be correcting which file during a disaster. After we got management's approval, the serious part of the development would start.
The data definition
The feedback from the managers was very positive. They had no comments since this was so much more advanced than the outdated spreadsheet that had been used in previous years. We were given the go-ahead to proceed. We developed the table shown in Figure C, listing the data definitions for the various items that needed to be captured. This was done so that the DBA would know how we wanted the actual table to be set up.
This represents the shift in the development process from user mode to developer mode. We're now defining items in a way that a database administrator would use to set up a table in Microsoft SQL Server. This could be applied to any of the databases on the market today. This transition illustrates the move from what the users want to a picture of what database administrators need to do their job.
Each of the column headings describes what is being captured. You can see that Column Name is the name we would like to call the data element captured. Abbr_nme is the abbreviated name for this data element. Description is what this particular piece of information is. It's documented in a format that is end-user oriented. The Type field describes the type of data we want to capture. Size is the size of the field. Null represents whether a field can be blank. The last column, Values/default, tells what the table will come up with if nothing is entered.
We originally developed this database to work with Microsoft Access. However, our company standards don’t allow Access to be a primary database for intranet systems. After discussion with the DBA, we decided to use SQL Server. (This is the company standard that was recently implemented.) The DBA took this layout and modified it to work with SQL Server. His comments were incorporated to optimize SQL Server interaction.
Time to organize
Now that we have a table generated, it is time to step back and organize the next steps to get this system online. With the spreadsheet, system flowchart, prototype, data definition, and the table that was created, the coding of the actual intranet development can begin. Even though this is for the in-house user community, the users will be developers who will quickly pass judgment on the usefulness and flow of the site.