So tell me, what’s a Hadoop when it’s at home? Some kind of dance?
Not so much. Think of it as a file system for distributed computing and storage. Because that’s what it is.
What’s a file system for distributed computing and storage when it’s at home?
Let me take you back to big data.
What’s big data?
You know all that stuff you’ve got that fits in nice relational databases?
Well, it’s that and a whole lot more. It’s that and the other stuff – the unstructured bumph, like bits and pieces posted on blogs or on social media, the data gathered from sensors, or from CCTV cameras, or log files. In short, it’s everything you collect, but don’t know what to do with.
And, as the name big data would imply, there’s a lot of it. Thanks to all these new systems and services that need monitoring and the decreasing cost of storage, businesses are retaining lots more data than they have in the past.
Hadoop is a system designed to help organisations get to grips with all that data and turn it into information they can understand and use.
So what does it actually do?
Well, previously if you needed to tackle a relational database, you might have turned to a centralised platform with a load of shared storage and CPU.
Nowadays, to process a lot of unstructured data, you need a lot of compute resource. One way to get that is to use a distributed system – for example, a load of commodity servers, each with its own local storage and CPU.
That’s where Hadoop comes in, letting all that distributed commodity stuff come together to work on the same problem.
Another key Hadoop component, Hadoop Distributed File System (HDFS), ensures that each piece of data will be stored on more than one server – handy if one part of your storage goes down, as the cluster can continue to work and no data will be lost.
Another of its core components, the framework MapReduce, allows applications to split up the processing work that needs doing into lots of different bits and parcel those bits out to all the nodes in the cluster. It then collects up all their answers and combines them back into a single answer.
Right, so what’s this all being used for at the moment?
The list of Hadoop users reads like a who’s who of tech’s big names: Amazon, eBay, Facebook, LinkedIn, Twitter and Yahoo all make use of Hadoop. These companies have huge volumes of data on their users that they regularly need to analyse. Think of those ‘People you may know’ or ‘People who liked X also bought Y’ features on Facebook and Amazon, for example – companies need to scour through vast logs of their users’ details and behaviour for relevant results, which is where Hadoop comes in.
Who owns Hadoop then?
Hadoop is an open-source product, so no one owns it as such. There are several different distributions, as you would expect, but the most popular – and the one that vendors such as IBM and Oracle are rolling up into their big data offerings – is Apache Hadoop.
However, the nature of the open-source beast is that various distributions of a product can appear. Yahoo, for example, made its own version of Hadoop – unimaginatively named the Yahoo Distribution of Hadoop – but canned it earlier this year in favour of putting its weight behind Apache Hadoop, and has been a…
…major backer of the open-source project since.
Earlier this year, Yahoo spun out its Hadoop efforts to make Hortonworks, a company that works on Hadoop development as well as providing services for companies wanting to install Hadoop.
Do companies need help installing Hadoop then?
Well, one of the criticisms levelled at Hadoop is indeed that it isn’t too easy to manage and use – it’s more of a job for the technically minded than the average end user.
“Installing, configuring and administering a production-scale Hadoop cluster requires considerable system administration expertise. Interacting with Hadoop requires a detailed knowledge of programming languages,” a recent report by analyst house Gartner said.
It’s worth noting that a number of companies are working on solving the installation problem, including Dell which recently announced a product called Crowbar that automates the installation of Hadoop onto commodity servers and has been generating a bit of buzz about the project.
Other business problems that need solving before Hadoop can see more widespread uptake, according to Gartner, is the need for better integration with existing business intelligence tools and the development of a user interface for non-technical end users, perhaps focusing on data visualisation.
That shouldn’t put companies off deploying it, mind. Organisations wanting to jump on the Hadoop bandwagon could lose the first-move advantage if they’re put off by technical considerations, Gartner said.
So where did Hadoop come from?
An interesting question. The inspiration for Hadoop was a couple of papers published by Google.
Think about it – when it comes to big data, there are few companies gathering quite as much as Google. After all, it’s trying to index the entire web and more besides.
This inspired technologist Doug Cutting – who’d been involved in two open-source search projects, the software library Lucene and web crawler Nutch – to create Hadoop as a way of enabling these projects to take advantage of distributed computing.
Hadoop itself is named after a toy elephant owned by Cutting’s son.
ZDNet UK’s Jack Clark contributed to this report.