For the last year or so, I have been hearing a lot about Hadoop. I wasn't sure what it was or how it worked, but a lot of very smart people seemed to be using it to accomplish a good number of "big iron" computational tasks.
Around the time when I decided to learn more about Hadoop, the folks from Karmasphere reached out to me and asked if I was interested in learning about Hadoop or their Karmasphere product for developing against it. It sounded like a good idea, so I had two in-depth conversations with Martin Hall (co-founder and CEO of Karmasphere), and Abe Taha (the VP of Engineering at Karmasphere) joined us for the second conversation. I'll share what I learned about Hadoop, and explain how to develop against it.
Hadoop in a nutshell
Part of the reason why I had a difficult time getting a handle on Hadoop is that the Hadoop ecosystem is filled with buzzwords and phrases that only make sense to those who are already familiar with it. Martin really cut through the junk, and now I understand it.
So, to put it in plain English: Hadoop is a platform for performing distributed computing. That's easy enough to understand, right? There are some add-ons for things such as distributed file storage and distributed database access, but at the heart of it, Hadoop is a processing platform that partitions the work across multiple machines in a cluster.
What problems does Hadoop address?
Hadoop is currently aimed at "big data" problems (say, processing Census Bureau data). The nice thing about it is that a Hadoop cluster scales out easily, and there are a number of providers who will let you add and remove instances from a Hadoop cluster as your needs change to save you money. It is the kind of system that lends itself perfectly to cloud computing, although you could definitely have a Hadoop cluster in-house.
While the focus is on number crunching, I think that Hadoop can easily be used in any situation where a massively parallelized architecture is needed.
How Hadoop works
What makes Hadoop easy on the developer is that your code plugs into the Hadoop architecture. In some similar systems (e.g., the .NET Parallel Extensions Library), the system provides methods for your class to call, which means that you have to figure out where the system fits into your application.
With Hadoop, you typically write Java classes that implement particular interfaces for each stage of the Hadoop processing cycle. I say typically because you do not have to write Java classes if you do not want; you can also pass the data to a command line or an application written in Ruby, Python, C#, etc., so long as it can parse Hadoop's input format and produce its output to Hadoop's requirements. This gives you supreme flexibility when designing your Hadoop applications, and it is something that I consider critical.
Hadoop is based on Google's MapReduce concept. With MapReduce, the problem is split into various discrete stages. At the end of each stage, the results are collected and passed onto the next stage. The "Map" part is when the input is split up and "mapped" to functions to process it, and the "Reduce" half is when the results are aggregated and turned into output. Each stage receives input in a standardized fashion, and produces output in a key/value pair style.
It's a very factory line process, and the code in each part of the process knows nothing about the other stages or the other instances within that stage. As a result, the code is very simple to write, without any of the traditional complications of distributed computing or parallel processes, such as communications between threads or processes, data locking, and so on. It also means that there are a good number of available classes to work with already that address common needs, and code is easily reused. For example, there are standard classes for parsing CSV files in the map stage (to split the records into fields) or to provide simple value summing in the reduce stage.
Using the Karmasphere tools
While it was easy to understand Hadoop up front, when Abe and Martin showed me the workflow using the Karmasphere Studio development environment, I was very, very impressed.
With the tools, you have an extremely simple workflow: There is a tab for each processing stage, and you choose the processing parameters (like the Java class or external processing component to work with) for each stage. Now this is what really blew me away: As you make changes, a preview pane in the tab below shows you exactly what that stage's output looks like for a sample number of rows. Another neat thing was that you did not need to recompile the classes — simply saving the Java files would update the preview. This means that you get instant feedback from the tool.
Once you are happy with the results, there is a simple deployment system. You provide the information to connect to a Hadoop cluster (regardless of version, vendor, etc.), and it uploads to the server very easily. Once it is on the server, the Karmasphere Studio tool (especially in the upscale Professional edition) provides additional diagnostics and profiling. Karmasphere Studio is built on Eclipse, so you can also develop your Hadoop plugins within it as well if they are written in Java or another language supported by Eclipse.
Karmasphere is also working hard on a product aimed at SQL folks (for instance, business analysts) called Karmasphere Analyst. They briefly demoed this tool for me, and it looks like a good option for people who need to easily query data.
Between the easy-to-understand nature of Hadoop and the Karmasphere tool, I am definitely going to give this technology a shot. Longtime readers know that parallel processing and distributed computing are special interests of mine. With the Rat Catcher project, I have been looking to the future and wondering how to scale it well and hopefully move the backend processing to a non-Windows platform to save money. For my needs, I think that Hadoop is on the table as an option.
I plan to experiment with Hadoop over the next several weeks as a proof-of-concept, and I will share my experiences and show the workflow and hands-on that I went through.
J.JaDisclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides; he has a contract with OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and articles; and he has a contract with OutSystems to write articles, sample code, etc.
Justin James is the Lead Architect for Conigent.