In the past, the idea of computer clustering technology was isolated to universities and NASA. That was then, and this is now. Thanks to Linux, large-scale clusters that are simple and inexpensive to install are now available to almost every IT shop that wants them. But despite clustering’s expansion into the general IT market, many administrators who are faced with creating a cluster still don’t understand the technology behind this invaluable technology.
In this case (as with most of IT), ignorance isn’t bliss. To rectify the situation, I will present the history of this technology, giving you a solid base of knowledge from which to create and maintain an effective computer cluster. Through an examination of the trailblazing path of the scientific Beowulf cluster, you’ll see a clear picture of where this technology came from, what you can do with it, and how it can best work in your organization. If you are just going through the motions with clustering, it’s high time to get to know it a little better.
Different types of clusters
Beowulf, shared storage, parallel systems, and NOW clusters (switch-based, high-bandwidth clustered workstations) are just some of the clustering concepts that are currently in use and growing more popular every day. Each one of these concepts features certain strengths that make one type of cluster more appropriate than another for certain applications. Here is a brief description of each of the different Linux clustering concepts.
1. Scientific clusters
Scientific clusters (like the Beowulf), which are created to solve specific computing problems, mainly consist of off-the-shelf systems and components. The intended goal of these clusters is to provide as much computational power as a supercomputer, while also still being expandable to add more power when required. In addition to the Beowulf-type cluster, other scientific clusters such as Cplant (used in the study of fluid dynamics) are also used to throw raw computational power at a problem.
2. High-availability clusters
A high-availability cluster is built to do only one job: provide a service and make sure it stays available, even in the event of a failure. A high-availability cluster generally consists of two or more master nodes and possibly a number of nodes that provide underlying services, such as Web or database query services. If a master node fails, one or more secondary master nodes are informed (by use of a heartbeat process) and take up the service in a matter of seconds. All master nodes are configured to provide the necessary services from the slave nodes. As long as at least one master node is available, so are the services. Specific projects such as Ultra Monkey and the Linux-HA project help to create these types of clusters.
3. Load-balancing clusters
This third type of cluster makes sure that no one machine in a cluster is overworked. Most commonly, a load-balancing cluster is used in Web hosting situations where a product needs to be able to support many connections but also has to appear as one system to the outside world. In this scenario, a master node uses an algorithm to determine which machine in the cluster is least used and sends the job to that machine. In less advanced load-balancing clusters, a round-robin approach may be used instead, whereby each machine is assigned a number and each request from the outside is simply assigned to the next machine in the numerical lineup. This does not address the issue of high availability. Some projects that make use of load-balancing systems are the Linux Virtual Server project and MOSIX. The previously mentioned Ultra Monkey project also makes use of the Linux Virtual Server.
Beowulf and its background
When people in the know think of clustering, the first thing that comes to mind is Beowulf. In today’s IT world, “Beowulf” doesn’t so much refer to the Old English poem featuring a hero by the same name but instead is equated with NASA’s technique for pushing the limits on raw computing power. The fact is that Beowulf is clustering in its truest form. Beowulf clusters are relatively easy to set up, inexpensive, and can make very good use of commodity PC parts. But more than price, ease of use, or any of its other advantages, Beowulf’s main appeal comes back to the fact that it is built for raw power—nothing more, nothing less.
In 1994, the first Beowulf cluster was set up by NASA on Slackware. This cluster provided a computing environment that was capable of solving the exceedingly complex calculations needed by the Earth Space Science project research and development labs. Landing probes on comets, sending probes to other planets, computing a probe’s orbit, and landing and controlling them from millions of miles away is simply too great a job for one system running Microsoft Excel (even if it has 32 processors and 2 GB of RAM). So NASA created Beowulf.
Since its inception and “proof of concept” at NASA’s Goddard Space Center, Beowulf clusters have popped up in many labs and universities to solve even more large-scale problems. MAGI, for example, is an eight-node Beowulf cluster that serves as a multiuser machine for larger speech-recognition experiments.
Beowulf clusters are very attractive financially, especially when the cost is compared to that of a mainframe. For example, the IBM zSeries 800 mainframe (one of IBM’s midlevel Linux mainframes) starts out at approximately $400,000. According to IBM, this mainframe replaces “hundreds” of servers. With a Beowulf cluster, those “hundreds” of servers equate to one cluster of 296 IBM xSeries 220. For this example, I’m using IBM’s performance version of the xSeries 220 starting at $1,349, which would make the Beowulf setup total just under $400,000.
To put this into perspective, one of the largest Beowulf clusters to date is the ASGARD cluster at the Institute for Theoretical Physics in Sweden. This machine includes 502 processors, 251-GB main memory, and more than 2,000-GB disk space, which enable it to work at speeds of up to 266 billion Floating Point Operations per second—all at a cost of approximately $905,000. The cluster in my example is one-half the size of this mega-Beowulf cluster and would cost the same as one of IBM’s midlevel mainframes. Of course, only institutes such as university science departments and aerospace programs typically use a 296-processor cluster.
Beowulf clusters are built on Linux-based operating systems such as FreeBSD and Linux. The cost of acquiring these operating systems is free (if you are willing to take the time to download the iso image and burn the image to CD). In addition, a Beowulf cluster is also well suited to run on everyday, typical PC hardware. In fact, as of 1999, one node of NASA’s “theHive” Beowulf cluster was assembled with the following hardware:
- 64 Dual Pentium Pro PCs (128 processors)
- A 72-port Foundry Networks Fast Ethernet switch
- 8 GB of RAM
- 1.2 terabytes of disk
Off the shelf
All newer nodes on theHive run on off-the-shelf hardware based on Gateway and Dell file servers.
In addition to the hardware, setting up a Beowulf cluster requires some specialized software components. But like the hardware in a Beowulf cluster, the software can change from cluster to cluster. Because there isn’t room here for a comprehensive list, I will present only some of the most common software needed for Beowulf clusters.
- ID management
ID management is necessary to manage processes and handle IDs centrally on all machines in the cluster.
- Ethernet bonding
Ethernet bonding provides a way to bond together the throughput of multiple Ethernet adapters in a Linux machine to provide additional bandwidth.
- Operating System
Both the Linux and BSD operating systems provide the base for each node in a Beowulf cluster.
- A distributed computing environment
A Beowulf cluster must have a mechanism in which each component (PC) can communicate with one another. This is handled by such interfaces such as the Parallel Virtual Machine (PVM) or Message Passing Interface (MPI).
- Shared memory
Distributed shared memory/Network Virtual Memory is necessary for a Beowulf cluster to function.
In order to function properly, clustered systems need to be able to communicate with each other. Multiple standards have been developed to address this issue, but they all share this bottom line: They were created to facilitate communication between clustered nodes in an effort to maintain order and efficiency. The differences in the technologies for each standard relate mostly to their platform dependencies (from both a hardware and a software perspective). To clarify these differences, let’s take a look at the three most common standards.
Parallel Virtual Machine (PVM)
PVM is an interface that allows for a heterogeneous mix of Windows and Linux/UNIX machines to be networked together while appearing as a single large parallel system. Due to its heterogeneous nature, PVM can be used in almost any type of clustering environment and is common in Beowulf clusters.
Distributed Inter-process Communication (DIPC)
DIPC allows for message passing via almost any type of connection, be it network, modem, serial, and so on. While DIPC works on a variety of hardware architectures, including Intel’s i386, PowerPC, MIPS, Alpha, and SPARC, it runs only under Linux and does not support Windows nodes. DIPC requires minimal changes to the node’s kernel and runs mostly in the user space. The primary goal of DIPC is to become the UNIX Inter-process Communication (IPC) of the cluster, which is an ambitious aim since IPC has been a part of UNIX for such a long time.
Message Passing Interface (MPI)
MPI is available in a number of separate distributions, including the freely available MPICH. MPI is currently in its second generation and includes some of the features of PVM, including fault tolerance, process control, and more portability.
PVM and MPI are the two technologies that you are most likely to run across when setting up a cluster, and choosing between these two options is similar to choosing between Linux and Windows. As in the Linux vs. Windows debate, both PVM and MPI have their strengths and weaknesses. For example, while the MPI2 specification calls for a method of fault tolerance, the MPI1 standard does not. PVM, on the other hand, has a mechanism whereby tasks can be registered so the PVM will be notified when one task fails or the status of the virtual machine changes. Part of the reason that MPI and PVM approach their tasks differently is because, while they have similar aims, their design goals were very different. PVM was developed because developers needed a framework for heterogeneous distributed computing. MPI was developed to provide a standard message-passing interface for the vendors who were developing massively parallel processor machines so that their machines could interoperate with one another. This difference in design philosophy also explains why PVM is very well suited for a heterogeneous computing environment while MPI is better suited for a homogeneous one.
The world of Linux clustering is an ever-changing technological frontier. Not only are the clusters becoming larger but also the cost is becoming significantly lower as well. Add to this the newer, simpler clustering software (such as Ultra Monkey) that makes the creation of a number-crunching monster nearly as easy as installing an operating system, and you can easily see how the view of the Linux clustering horizon is a bright one.