Oracle touts Oracle9i as the “unbreakable” database and claims that using their Real Application Clusters (RAC) can provide architecture with guaranteed continuous availability.
The RAC product supersedes the older cluster product, Oracle Parallel Server (OPS). In this article, we’ll explore the architectural differences between RAC and OPS. We’ll begin by outlining the evolution of Oracle clustering solutions.
A recovery timeline
Since Oracle introduced recovery products 12 years ago, their technologies have evolved significantly:
- Traditional recovery (1990-1995)—This recovery method requires restoration of failed database files and a roll-forward using Oracle’s Enterprise Backup Utility (EBU) or the Oracle8 Recovery Manager (RMAN) utility. This type of recovery could take several hours.
- Standby databases (1993-present)—Oracle7 introduced mechanisms that allow a standby database to be constantly in recovery mode and to be refreshed from Oracle’s archived redo logs. In case of failure, the last redo log could be added to the standby database, and the database could be started in just a few minutes.
- Oracle Parallel Server (1996-2001)—The OPS architecture allowed for several Oracle instances to share a common set of database files. In case of instance failure, the surviving instances could take over processing. There was a significant performance issue with OPS because shared RAM blocks had to be “pinged” between instances, imposing an additional processing burden on the cluster.
- Real Application Clusters (2001–present)—The RAC architecture allows many instances to share a single database, but it avoids the overhead of RAM block pinging. RAC has also been enhanced to work with Oracle’s Transparent Application Failover (TAF) to automatically restart any connections when an instance fails.
In practice, companies usually choose an Oracle availability strategy based upon costs and their tolerance for unplanned database downtime, as shown in Figure A.
Move over, OPS
It came as no surprise to many Oracle database developers and architects when Oracle chose to discontinue OPS. The OPS product allowed for a single database to be shared by many Oracle instances. This architecture was quite good for massively parallel types of applications where the data could be segregated onto multiple Oracle instances. However, the OPS architecture suffered a serious shortcoming: It required that all data blocks be available to all instances. A cumbersome process known as integrated distributed lock manager (IDLM) constantly had to “ping” data blocks back and forth between the many instances in an OPS configuration.
To overcome the IDLM problem, Oracle overhauled the architecture of the OPS product and reintroduced it under a new name: RAC. RAC employs a new technology called Cache Fusion, whereby the data block buffers of all instances within the parallel server configuration reside in a single shared RAM memory region. By having all data blocks instantly available to all database instances, the problem of IDLM pinging is overcome, allowing the systems to run faster and with greater reliability than with OPS.
A peek into the future
Oracle promotes RAC as a generic online transaction processing (OLTP) solution for highly available systems. This is an important departure from their recommendation for OPS, which was mostly used by organizations with massively parallel systems that required continuous availability.
It remains to be seen whether Oracle will get the widespread adoption of RAC that they've been hoping for in the marketplace. Betsy Burton of Gartner noted that adoption of Oracle9i RAC has been quite slow, and she predicts that by the year 2006, only about 10 percent of Oracle users will be utilizing RAC within their production applications.
Does this mean that only 10 percent of Oracle customers require continuous availability? Clearly, the answer is no. Rather, many other companies are choosing alternatives to using RAC for continuous availability because the installation and configuration costs of RAC are high. Even after installation, you need to have a DBA or database architect on staff for maintenance and support. These positions are difficult to fill, in addition to being costly. A common alternative is for a company to write its own replicated databases and come up with methods that automatically redirect all transactions from a failed database to a backup, where they can be restarted.
Both OPS and RAC are designed to protect only against instance failure. Should any one instance (or the hardware associated with that instance) fail, Oracle's TAF will take over and then redirect any in-flight transactions to the surviving database. Of course, you can achieve the same objective by using distributed, replicated databases and having customized Web server code to redirect failed transactions.
It's important to note that Oracle's TAF tool has serious limitations. The most significant is that Oracle TAF does not support restarting of any Data Manipulation Language (DML) statements, including inserts, updates, and delete. For those customers using Oracle PL/SQL packages, all package states are lost when a database fails, requiring all PL/SQL stored procedures to be restarted from the beginning. The Oracle TAF product also does not support alter session statements, nor does it support global temporary tables failover.
Note in Figure B that you can specify a failover from one of the two Oracle failover modes and that you have a retry parameter.
The fact that the continuously available solution employs a retry parameter is very disturbing to many Oracle database architects because it implies that the failover may not work on the first attempt. Consumers are demanding systems that will automatically and reliably restart any in-flight transactions that might be running during the system failure, and the idea of delayed retries are onerous to anyone counting on continuous availability.
One final limitation: The RAC solution requires downtime in order to upgrade the Oracle software. Oracle is currently working to create a rolling update technology, but for now, you must take down RAC systems when you upgrade.
While RAC addresses the Oracle “ping” problem, it is an expensive solution to implement. Before you shell out the money for RAC, see if you can build in your own replicated databases and use Web servers to direct the failover.