With Forrester projecting that "100% of all large enterprises will adopt [Hadoop and related technologies such as Spark] for big data analytics within the next two years," the chances are pretty high that your enterprise is in the midst of a decision, or has already made it: Which Hadoop vendor do I pick? Though this will change over time, "currently there is no absolute winner in the market," Forrester pointed out, and it's easy to get confused trying to parse differences between the different stacks.
The Hadoop vendors themselves, however, give us clues as to who they think is winning, as Ovum analyst Tony Baer highlighted. All you have to do is look at who they position themselves against in their marketing literature.
Who hates whom?
Picking apart a variety of "objective" benchmarking studies, Baer rejected these studies as "self-serving exercises that vendors typically stack in their own favor." He's right, but he's also correct to suggest that, though the data in these surveys is not to be relied upon, there is "metadata" that tells much:
[L]ooking at the benchmarking press releases, you get a sense of who's afraid of whom. For Cloudera, it's Amazon. Competitive benchmarks pitted Impala 2.6, Cloudera's SQL-on-Hadoop MPP engine, against Amazon Redshift columnar analytic database....For its part, Hortonworks just released results this week aiming to (not surprisingly) one-up Cloudera...[though] Hortonworks has been playing catch-up.
These are the frontrunners, according to Forrester's methodology. Judging from Hortonworks' marketing literature, the real frontrunner between the two pure-play Hadoop vendors must be Cloudera, with Forrester acknowledging that "Cloudera's scope and pace of innovation is astounding," while "Hortonworks is a rock when it comes to its promise to offer a 100% open source distribution."
Longer term, however, it is Amazon that looms large as the biggest, most significant competitor.
AWS eats the world
Cloudera may be imposing the most near-term pressure on Hortonworks, but the 800,000-pound gorilla that promises to up-end the entire enterprise software market is Amazon Web Services. Amazon, after all, completely changes the way software is delivered and consumed, and is particularly appealing to data scientists who are trying to pick apart their data to uncover insights.
AWS product strategy chief Matt Wood said as much to me in an earlier interview, wherein he stressed the importance of building analytics projects on elastic infrastructure:
Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on. You need an environment that is flexible and allows you to quickly respond to changing big data requirements. Your resource mix is continually evolving—if you buy infrastructure, it's almost immediately irrelevant to your business because it's frozen in time. It's solving a problem you may not have or care about any more.
But that's hardware, right? Well, no. Yes, Wood's dictum applies to hardware infrastructure, but it also applies to the software components that give life to that hardware. In addition, things like Amazon's Elastic Mapreduce have significantly reduced the complexity inherent in running a Hadoop stack, making Hadoop (and its sister projects) more approachable. With more data living in the cloud, it will become ever more sensible to also run analytics there.
Two years ago Amazon CTO Werner Vogels outlined Amazon's long-term product strategy: "We're in the business of pain management for enterprises. Tell me what your pain points are and I'll help you make them feel better."
This broad view means that AWS will continue to expand its service offerings in data analytics, given just how much pain enterprises feel there. For the Hadoop vendors, then, they need to follow Cloudera's lead and look to the real threat to their businesses: Amazon and its cloud.
- Hadoop complexity is part of the master plan, says Cloudera exec (TechRepublic)
- Apache Spark rises to become most active open source project in big data (TechRepublic)
- Hadoop numbers suggest the best is yet to come (TechRepublic)
- Why AWS Lambda could be the worst thing to happen to open source
- Data science demands elastic infrastructure (TechRepublic)
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.