Today the increased demand for more information faster has moved even the traditionally non-mission critical systems to a mission critical role. Data has become invaluable to every company. One important job of the database administrator (DBA) is to ensure the safety, integrity, and recoverability of the system's data. In the past, a DBA would typically set up a backup and recovery plan for the databases ensuring that a complete image was taken on a regularly scheduled basis and that the transaction logs were backed up so that—in combination—these backups could recover the database to a specific single-image copy, a specific point in time, or as close to current as possible.
This approach was relatively simple, taking into account the applications it was supporting, the frequency of transactions, and the customer's tolerance to a failure. The backup and recovery, excluding the disaster recovery plan, was to recover the data in place, if possible. Back then, the term 24x7 was usually considered a misnomer that really meant there was time built in to the year for maintenance of the system, such as reorganization and/or software and hardware upgrades.
Backups in a 24x7 environment
Today when we speak of a 24x7 system, it is closer to the literal truth. Even software and hardware upgrades must be done without interruption to the applications. In response to this demand, integrators are designing systems with high levels of fault tolerance. Systems are configured with redundant servers for applications, databases, communication servers, power, networks, and so on—right down to the last bolt. The only missing piece is the duplicate manpower to implement it and maintain it.
So how does the DBA faced with all these levels of fault-tolerant hardware and software develop and implement a backup and recovery plan that leverages and integrates with these fault-tolerant and fail-over processes? Is there still a need to create the traditional backups for the database systems running in these environments? The answer is yes! However, there are many more variables to take into consideration.
New fault-tolerance methods
Take the example of a system running an operational (transactional) database server, such as an accounting system. The database server runs logging on the transactional information to provide a means of recovery in the event of a system or database failure. For the purposes of this discussion, let's say our system is configured to have two database servers that are connected to the same RAID array. In other words, both database servers can see the same disk. Since two database servers cannot manage the same database objects at the same time, we will have designed the system so that only one database server is online at a time. This database server is known as the primary server. The second database server sits idle, waiting to be activated in a failure. It is known as the secondary server. Naturally, we have configured this second database server to be identical in its hardware, software, and configuration to our primary server. In this way, the secondary server will be able to quickly come online and run within our system in the event the primary database sever fails.
This type of configuration is not new, and in the past some systems have used a form of database replication to keep the two systems in sync. The issue with this type of synchronization has always been the difficulty in implementing and maintaining it. In addition, it usually applied to a database within a database server (rather than the complete database server itself ) or was based on the primary server's transaction logs. In the case of the latter, until a transaction log was complete it was not shipped to the secondary server, leaving a failure window equivalent to the completion time of the database server transaction log. This time could vary depending on the transaction load on the system. Although providing a better measure of fault tolerance and a promise of a shorter down time, the secondary database server is still only as good as the synchronization of the data with the primary.
To address this issue and minimize the synchronization of database data between the two servers, some systems today are opting to use disk replication software such as Legato. These software solutions ensure any changes to the primary disk are written to the secondary disk. In addition, these software solutions can also be configured to watch processes on a specified server, such as a database server, and to fail over to the server designated as the secondary server in the event the primary server fails to respond. In the event of a failure, the watching process has to identify it and execute the necessary steps to reconfigure the applications and user processes to point to the secondary server. All this fail-over is supposed to take mere seconds and show no noticeable failure to the user so that, theoretically, the system would never be interrupted.
The bottom line is that the system would fail to a secondary system if the server fails, while the disks are fault tolerant with RAID arrays protecting them, and network access is typically duplicated and redundant. Put all these factors together and a failure is usually invisible to the end user. This type of fault-tolerant environment can also provide a method of completing software or hardware upgrades without system interruption.
So why go through the steps to analyze and determine a sound backup and recovery strategy in such a perfect fault-tolerant world? Well, backups are still important for several reasons.
Backups become more complex
A fault-tolerant system is wonderful for helping to ensure the system is available to users continually; however, it does not protect against disasters, user error, or a multiple-component failure.
The traditional backup and recovery strategies are still required; they simply must be modified to integrate with the other fault-tolerant configurations such as hot or cold standby servers, disk replication software, and network address hiding and automatic IP switching. These are just some of the variables mentioned in the 24x7 sample configuration that we've been talking about.
Database image copies, which are the foundation of a backup strategy, should still be run. A DBA must still consider the impact to the system and the customer's failure tolerance when determining how often to run a complete database image copy. It is even more critically important that a copy of the database image is sent to a transferable media, such as tape, and sent offsite to facilitate disaster recovery.
Transaction log backups make up the second layer of the backup and recovery strategy. Within a system that is designed as a secondary database server, transaction log backup intervals may be influenced by the method of synchronizing the primary and secondary system. Take the example of a system that is keeping a secondary server in synch with the use of the kind of log shipping available in SQL Sever. In order to configure this type of fail-over strategy, the DBA has to consider some of the following questions:
- If the secondary system is being synchronized by the shipment of transaction log files, what is the best interval?
- How will the amount and frequency of the shipment of this data affect the network?
- What is the delay in backing up and shipping the files?
- How will this delay affect the ability to recover the system?
- What is the purpose of the secondary system? (Consider fail-over, system maintenance, and workload balancing.)
- If the network or server fails, how will the available logs be accessed in order to complete the backup database synchronization?
The questions above show us that although fault-tolerant environments have increased the uptime of database servers, they have also added a level of complexity to the traditional database backup and recovery strategy. It is even more critical that the DBA fully understand the architecture of the database environment in order to ensure all possible failure scenarios are addressed. The answer is not to pick one strategy over another, but to marry the fault-tolerant architecture and strategy with traditional backup and recovery. Find a good interval for image copy that meets the recovery strategy required in the case of a multiple component failure or disaster. Choose a transaction backup interval that fits both the log shipment requirement and the log backup. The image copies and backed up logs should not only be shipped to the secondary system, but also backed up to another system and then copied to tape in order to have copies both on-site (for quick recovery) and off-site (for disaster recovery).
The bottom line is that just because a system seems fault tolerant and is protected against component failure and maintenance outages, remember that a solid backup and recovery strategy is still a vital piece of protecting data. That means that there are simply more variables influencing the backup and recovery plans and decisions that a DBA has to make.