Big Data

Setting up a Hadoop development environment

Windows developers can use Karmasphere Studio to get the Hadoop platform installed for local use. Justin James offers tips on the Karmasphere Studio installation.

Hadoop is a project that I've been keeping track of lately. This excerpt from my previous column about Hadoop basics explains what it is in a nutshell:

Hadoop is a platform for performing distributed computing. That's easy enough to understand, right? There are some add-ons for things such as distributed file storage and distributed database access, but at the heart of it, Hadoop is a processing platform that partitions the work across multiple machines in a cluster.

I've been meaning to give Hadoop a shot and do some basic tutorials with it. Here's how to take the first step, which is to configure a development environment.

You need the "Core" package to download Hadoop. The Core package includes the "Common" package (the base clustering technology), "HDFS" (the distributed file system), MapReduce (the data processing component), and Web front ends to it all.

It starts to get tricky from the moment you unpack the tarball. Hadoop is a Java application, which means that, in theory, you can run it on a Windows PC just fine. For better or for worse, most of us have Windows PCs on our desktop, but the Hadoop package assumes you are using Linux. All of the documentation is about running various bash scripts that come in the package. While I do have a FreeBSD server that I could try this on (which would give me the bash end of things, at least), it does not and will not have Java installed on it -- installing Java on FreeBSD is a huge hassle due to some bizarre licensing on the part of Java.

After reviewing some choices for getting Hadoop to run on Windows as a development platform (it's not recommended as a production system on Windows), I suggest that if you're using a *Nix platform that has Java installed, install Hadoop there and run it according to the instructions. It's not difficult at all. You just open the tarball and run bin/start-dfs.sh and bin/start-mapred.sh, which will get the daemons up and running, and from there you can connect to your local Hadoop cluster. For a Windows user, your best bet is not to try to install Hadoop yourself but to use Karmasphere Studio (there is a free Karmasphere Studio Community Edition).

Warning: The Karmasphere Studio installation is not trivial, but it is a lot easier than installing Hadoop without it. You will need to install the prerequisites if you do not have them already: Eclipse 3.6 (Helios) and Java 1.6 update 16 (or better). I don't know enough about the Java ecosystem to make well-informed decisions about the minimum tech needed, so I got the big Java package (Java EE SDK) and the big Eclipse package (Eclipse IDE for Java EE Developers). The install process went something like this:

  1. Install the Java EE SDK.
  2. Find the bin directory from the Java install and add it to my PATH environment. To do this, right click Computer in the Windows Start menu, choose Properties, click Advanced System Settings, go to the Advanced tab, and click Environment Variables.
  3. Unzip/tar the Eclipse ZIP file and put it in the appropriate Program Files directory.
  4. Right-drag the eclipse.exe file to your Start menu to create a shortcut to it.
  5. Edit the eclipse.ini file as per the Karmasphere instructions.
  6. Register for Karmasphere Studio Community Edition and then follow the rest of the installation steps.

Once I was able to use Karmasphere Studio, I started Eclipse and followed Karmasphere's tutorial for local development.

Compared to all of the other information I saw out there about getting Hadoop installed for local use, Karmasphere Studio was definitely the easy route. Kudos to Karmasphere for packaging it up nicely and making Hadoop accessible to Windows developers.

J.Ja

About

Justin James is the Lead Architect for Conigent.

2 comments
daboochmeister
daboochmeister

I haven't done it, but talking to others who have, if you install Cygwin and make sure the Cygwin OpenSSH package is installed, then you just unpack Hadoop, adjust one setup file, and you're all set. That wasn't your experience? I don't know if that adjusts the environment for use with Karmasphere ... maybe that was your end goal ... but certainly you have a usable Hadoop development environment at that point. Worth pointing out that use of Hadoop on Windows is supported, but only as a development, not (as you say) a production (execution) environment.

Justin James
Justin James

I know that Cygwin is a big part of it, but the directions I saw made it seem like there was a lot more to it than that. There's something about Cygwin that I've always distrusted. Maybe it's because a long time ago, I did some stuff with Cygwin and it burned me. At the same time, going the Karmasphere Studio route gives me a LOT more than just a place to deploy Hadoop jobs against locally, it gives me a full development system which is a big benefit. :) J.Ja