Does open source matter to Hadoop?

Find out why Matt Asay doesn't think it matters if Hadoop is open source, at least not as much as we may think.



Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware. As an increasingly essential tool for data scientists looking to crack complex questions (e.g., “Will this person click on this funny cat ad”), a slew of companies has embraced it as their own. Most of these companies treat Hadoop as a cheap complement to their proprietary products, rather than contribute meaningfully to its development.

Is this wrong? More importantly, is it effective?

A new open-source holy war

After all, Hadoop is an Apache Software project, carrying a license that essentially says, “Do whatever you want with this software, but don’t blame me if it doesn’t work.” There is no requirement -- moral or otherwise -- that developers contribute back.

Gartner analyst Merv Adrian captures this nicely:

"Having some components of your solution stack provided by the open source community is a fact of life and a benefit for all. So are roads, but nobody accuses Fedex or your pizza delivery guy of being evil for using them without contributing some asphalt. Commercial entities (including software and IT services providers) provide needed products and services, employ people and pay taxes. We might want them to do more charitable work or make more open source contributions, and some do, but they are not morally obligated to do so."

True enough. But as Red Hat has long held, code is currency in open source. She who contributes the most code to a given project has the most influence on that project and is best able to steer it in a way that's advantageous to their customers. This thought was echoed by Hortonworks' executive Shaun Connolly in the comment section of Adrian’s post:

"It is difficult to drive a real enterprise-focused roadmap or fix/patch major issues if you don’t have engineers working to make that happen within the community projects. And if you’re doing your work off to the side of the community, then there’s no clear path for those changes to work their way into the upstream community efforts."

With Connolly’s thought in mind, three years ago, the Hadoop market was mostly concerned with who contributed most to its development. Today, that’s still an issue, but more attention is being paid to those who contribute most to making Hadoop usable by mainstream enterprises, given its complexity.

Contributing convenience

According to a new KPMG survey, 96% of CIOs and CFOs surveyed say that they could do a better job deriving value from data through analytics, and 56% say at least some of the resulting benefits "left on the table" could be significant. These C-level executives perhaps should care about a vendor’s ability to get code into the Hadoop kernel -- but arguably, they don’t. Nearly 50% of attendees to a recent Gartner webinar cited Hadoop's lack of a clear value proposition as its biggest barrier to adoption.

In other words, they just want someone to make sense of Hadoop.

Cloudera, for its part, has been pitching its “enterprise data hub” strategy as a way to make Hadoop consumable by mainstream enterprises. While Hortonworks has stuck to its strategy of ensuring that all innovations around Hadoop are open source, Matt Brandwein, director of Product Marketing at Cloudera, notes that Cloudera is “building out CDH [Cloudera’s Hadoop distribution] -- the open source foundation of our enterprise data hub platform -- along with the management tools, certifications, partner integrations, and support that our customers require to deploy Hadoop for real production use cases.”

By some measures, Cloudera’s strategy has been more successful. Based on general interest (measured by Google search traffic) or jobs (measured by job postings), Cloudera is in the lead.

And yet, over the past year, I’ve heard from many sources that Hortonworks has been on a tear, winning new customer accounts and growing revenues at a torrid pace.

Even so, it’s very possible that neither will win.

The return of the incumbents

Cloudera and Hortonworks aren’t the only two Hadoop vendors in the market. And according to some recent survey data from Gartner, they may not be the vendors that enterprises prefer when looking to leverage Hadoop. Instead, a majority of respondents (Figure A) want their tried-and-true BI vendors to deliver Hadoop value.

Figure A


Figure A

Results of a recent Gartner survey.


This isn't heartening to the pure-play Hadoop vendors, but it’s not surprising. Such customers don’t really care about open-source bragging rights. They just want Hadoop tied into their existing data infrastructure.

As Scott Gnau, president of Teradata Labs, opines:

"[Hadoop is] not so interesting that it’s open source. What’s really interesting is that it’s a way to store data without making any change to the data, and store it in a detailed fashion and process it in a massively parallel way."

But even this innovation won’t be interesting unless Hadoop vendors can close “the gap between the analysts and the data,” as Gary Nakamura, CEO of Concurrent, expresses. Nakamura goes on to argue, “The way to address this [gap] is to hide the complexity of Hadoop so that analysts can get work done without having to become Hadoop experts.”

Does it matter if this is open source? Not as much as we may think. The first priority is to get something that works and makes life easier for mainstream analysts. Only once this core problem is solved will anyone care about how open the software is.