Many companies have large-scale computational needs that demand large-scale processing systems and their accompanying large-scale price tags. Or do they?
Through distributed computing, a company can avoid costly cluster setups by harnessing the idle processes of any number of client and server machines. Combing all these remnant processing cycles can create the number-crunching power needed for almost any job.
In this Daily Drill Down, I’ll discuss the benefits of distributed computing and show you how to install and use Condor, an open source, distributed computing batch system.
What is distributed computing?
A distributed computing system provides a specific set of services (applications, compilers, rendering functions, benchmarking) with certain properties (system names, security, user identification, access to similar functions, and centralized management) throughout a network at times when available processing resources are idle.
Every distributed computing platform provides some type of service. In most cases, a distributed system’s goal is to harness the computing power of a great number of machines to perform calculations in areas of science, biology, finances, or other statistical analysis.
Why not just use a cluster?
Many distributed computing applications don’t need the kind of resources required by such number-crunching systems as the Internet-based SETI or protein folding initiatives. In fact, a distributed computing environment can be set up in almost any organization by making use of the spare CPU power on users’ desktop machines.
This approach is much different than setting up a cluster, which is a series of networked computers devoted to working on only cluster activities—nothing more, nothing less. A distributed node, on the other hand, can handle other tasks, such as e-mail and office productivity software, and tackles distributed tasks only when a set threshold of CPU resources are idle.
The most obvious benefit of this distributed approach is cost savings. You’ve already got all those Pentium-grade desktops sitting around, so why not use them? Of course, nothing’s that simple, and the benefits of using a distributed system instead of a cluster come with corresponding drawbacks.
While a distributed system can save you a bunch of money, such systems are not as powerful as pure clusters with the same number of machines. And while risk of failure is especially low in the distributed computing model—the system can handle machines coming and going at any time—distributed systems have less (sometimes none) administrative control over machines.
How much power?
Distributed computing systems have the potential for enormous computational power. To illustrate the scale of this untapped resource, I used a calculator offered by distributed computing system vendor Entropia that roughly determines the computing potential of idle CPUs and compares it to the power of several machine configurations.
I calculated the distributed computing power of 15 1-GHz Pentium III processors and 15 1.5-GHz Pentium 4 processors. Assuming these systems were set up in a distributed fashion and using most of their CPU cycles for large-scale computation, they would provide 31.785 gigaflops of computing power–more than three times the 9.6 gigaflops provided by an SGI Origin 2000 server with 16 processors.
Knowing that you can get so much power from idle processors makes looking into a distributed system that much more appealing. And fortunately, there’s an open source alternative to products from vendors such as Entropia and United Devices.
Now let’s take a look at actually setting up a distributed system with the open source solution, Condor, which you can download from the University of Wisconsin’s Web site. I’m going to install the 6.3.1 development release since it has support for the 2.4 version of the Linux kernel in my Red Hat Linux 7.2 installation. I’ll save this file, condor-6.3.1-linux-x86-glibc22.tar.gz, in /usr/src/ on my Linux server.
Condor offers several key features:
- · Flocking technology that allows multiple Condor sites to work together, extending the distributed model to almost unlimited scalability.
- · An option to use the idle resources of a machine only when the keyboard and mouse are idle, so as not to take away from CPU power an active user requires.
- · “Checkpoints” that let Condor seamlessly move any distributed processes on machines that become unavailable to other machines.
Remember that as more machines join the distributed network, the more powerful the network will become. You should take careful stock of how you intend to use Condor. If the plans involve large-scale number crunching, make sure you install Condor on as many machines as possible.
For this sample installation, I’m not going to take into consideration any specific needs. The process I’ll illustrate on one test server will be essentially the same on as many machines as you need to add to the distributed system.
My server, named dhcppc4, will act as the central manager for my distributed environment. In addition, I’ll create a user and group named condor on all machines in my environment to make administration a little easier. To add the user and group condor, I run the following commands (as root) on every machine within the distributed network:
/usr/sbin/useradd condor -g condor
Next, I’ll unpack the Condor distribution with the following two commands:
gunzip -dc condor-6.3.1-linux-x86-glibc22.tar.gz | tar xvf –
Now I’ll switch to the distribution directory (cd/usr/src/condor-6.3.1) and run the Condor installation script with the command, /condor_install.
The installation script asks a number of questions, shown in Listing A, that you should carefully consider. For my installation, I simply accepted the defaults and named my machine pool home when prompted. Don’t assume that the default options will work for any given situation, because every environment is different. As I stated, read the installation questions carefully or you’ll probably have to re-run the installation.
I’ll start my installed Condor system by entering the /usr/local/condor/sbin/condor_master command at the server’s terminal prompt. To make sure Condor processes are running, I can use the ps command as follows:
ps –ef | grep cond
The above command results in the output shown in Listing B, which in our example shows that five Condor processes are running. Now that the Condor master is running, I need to start the system on all other machines on the distributed network. I’ll start all client machines in the same way I started the master, with the /usr/local/condor/sbin/condor_master command.
Sending a job to Condor
My next step is to configure a job and send it to the Condor distributed system. I’ll use the condor_submit command, but before I run this command I must create a submit description file for the job. The submit description file contains everything Condor needs to know about the job, such as the executable to run, the initial working directory, and command-line arguments for the executable.
Let’s say I need to see how well a particular Web server can take a pounding from 25 or more clients, but I can’t expect 25 end-users to understand how to submit the httperf command to the Web server. Condor is a perfect tool for this chore. I’ll use the httperf benchmarking tool to hit a Web server on IP address 192.168.1.1 (this will be a machine that all Condor clients can reach) with no less than 25 machines and no more than 30 running on Intel architecture and the Linux operating system. The submit description file (named condor_httperf for this job) would look like:
Executable = /usr/bin/httperf
Arguments = —hog —server 192.168.1.1 —num-con=100 —rate=10 —timeout=5
Output = /tmp/condor_output
Error = /tmp/condor_errors
machine_count = 25 | 30
Requirements = (Arch == "INTEL" && OpSys == "LINUX")
This particular job requires that the httperf application be installed on all available machines. With the submit description file in place, I’ll run the command condor_submit condor_httperf and the job will be submitted to Condor’s job pool.
The output from the httperf job will be saved in /tmp/condor_output, and any errors generated by the job will be saved in /tmp/condor_errors.
That's not all
Condor is an amazingly powerful, elegant solution for distributed systems. But its scope of size creates a complexity that can be confusing. I strongly suggest that to fully understand Condor, you comb through the online manual, making sure to focus on the sections Condor Matchmaking with ClassAds, Road-map for running jobs with Condor, Submitting a Job to Condor, and Managing Your Condor Pool. Understanding these particular sections will make managing processing assets with Condor a snap.