Hadoop is a project that I’ve been keeping track of lately. This excerpt from my previous column about Hadoop basics explains what it is in a nutshell:

Hadoop is a platform for performing distributed computing. That’s easy enough to understand, right? There are some add-ons for things such as distributed file storage and distributed database access, but at the heart of it, Hadoop is a processing platform that partitions the work across multiple machines in a cluster.

I’ve been meaning to give Hadoop a shot and do some basic tutorials with it. Here’s how to take the first step, which is to configure a development environment.

You need the “Core” package to download Hadoop. The Core package includes the “Common” package (the base clustering technology), “HDFS” (the distributed file system), MapReduce (the data processing component), and Web front ends to it all.

It starts to get tricky from the moment you unpack the tarball. Hadoop is a Java application, which means that, in theory, you can run it on a Windows PC just fine. For better or for worse, most of us have Windows PCs on our desktop, but the Hadoop package assumes you are using Linux. All of the documentation is about running various bash scripts that come in the package. While I do have a FreeBSD server that I could try this on (which would give me the bash end of things, at least), it does not and will not have Java installed on it — installing Java on FreeBSD is a huge hassle due to some bizarre licensing on the part of Java.

After reviewing some choices for getting Hadoop to run on Windows as a development platform (it’s not recommended as a production system on Windows), I suggest that if you’re using a *Nix platform that has Java installed, install Hadoop there and run it according to the instructions. It’s not difficult at all. You just open the tarball and run bin/start-dfs.sh and bin/start-mapred.sh, which will get the daemons up and running, and from there you can connect to your local Hadoop cluster. For a Windows user, your best bet is not to try to install Hadoop yourself but to use Karmasphere Studio (there is a free Karmasphere Studio Community Edition).

Warning: The Karmasphere Studio installation is not trivial, but it is a lot easier than installing Hadoop without it. You will need to install the prerequisites if you do not have them already: Eclipse 3.6 (Helios) and Java 1.6 update 16 (or better). I don’t know enough about the Java ecosystem to make well-informed decisions about the minimum tech needed, so I got the big Java package (Java EE SDK) and the big Eclipse package (Eclipse IDE for Java EE Developers). The install process went something like this:

  1. Install the Java EE SDK.
  2. Find the bin directory from the Java install and add it to my PATH environment. To do this, right click Computer in the Windows Start menu, choose Properties, click Advanced System Settings, go to the Advanced tab, and click Environment Variables.
  3. Unzip/tar the Eclipse ZIP file and put it in the appropriate Program Files directory.
  4. Right-drag the eclipse.exe file to your Start menu to create a shortcut to it.
  5. Edit the eclipse.ini file as per the Karmasphere instructions.
  6. Register for Karmasphere Studio Community Edition and then follow the rest of the installation steps.

Once I was able to use Karmasphere Studio, I started Eclipse and followed Karmasphere’s tutorial for local development.

Compared to all of the other information I saw out there about getting Hadoop installed for local use, Karmasphere Studio was definitely the easy route. Kudos to Karmasphere for packaging it up nicely and making Hadoop accessible to Windows developers.