Hadoop: Cheat Sheet

An elephant-themed, open-source way to tackle big data...

So tell me, what's a Hadoop when it's at home? Some kind of dance?
Not so much. Think of it as a file system for distributed computing and storage. Because that's what it is.

What's a file system for distributed computing and storage when it's at home?
Let me take you back to big data.

What's big data?
You know all that stuff you've got that fits in nice relational databases?

Well, it's that and a whole lot more. It's that and the other stuff - the unstructured bumph, like bits and pieces posted on blogs or on social media, the data gathered from sensors, or from CCTV cameras, or log files. In short, it's everything you collect, but don't know what to do with.

CCTV camera

Footage gathered from CCTV cameras is one example of unstructured dataPhoto: Shutterstock

And, as the name big data would imply, there's a lot of it. Thanks to all these new systems and services that need monitoring and the decreasing cost of storage, businesses are retaining lots more data than they have in the past.

Hadoop is a system designed to help organisations get to grips with all that data and turn it into information they can understand and use.

So what does it actually do?
Well, previously if you needed to tackle a relational database, you might have turned to a centralised platform with a load of shared storage and CPU.

Nowadays, to process a lot of unstructured data, you need a lot of compute resource. One way to get that is to use a distributed system - for example, a load of commodity servers, each with its own local storage and CPU.

That's where Hadoop comes in, letting all that distributed commodity stuff come together to work on the same problem.

Another key Hadoop component, Hadoop Distributed File System (HDFS), ensures that each piece of data will be stored on more than one server - handy if one part of your storage goes down, as the cluster can continue to work and no data will be lost.

Another of its core components, the framework MapReduce, allows applications to split up the processing work that needs doing into lots of different bits and parcel those bits out to all the nodes in the cluster. It then collects up all their answers and combines them back into a single answer.

Right, so what's this all being used for at the moment?
The list of Hadoop users reads like a who's who of tech's big names: Amazon, eBay, Facebook, LinkedIn, Twitter and Yahoo all make use of Hadoop. These companies have huge volumes of data on their users that they regularly need to analyse. Think of those 'People you may know' or 'People who liked X also bought Y' features on Facebook and Amazon, for example - companies need to scour through vast logs of their users' details and behaviour for relevant results, which is where Hadoop comes in.

Who owns Hadoop then?
Hadoop is an open-source product, so no one owns it as such. There are several different distributions, as you would expect, but the most popular - and the one that vendors such as IBM and Oracle are rolling up into their big data offerings - is Apache Hadoop.

However, the nature of the open-source beast is that various distributions of a product can appear. Yahoo, for example, made its own version of Hadoop - unimaginatively named the Yahoo Distribution of Hadoop - but canned it earlier this year in favour of putting its weight behind Apache Hadoop, and has been a...