There's nothing so constant as change in big data land, and nowhere is this more true than in Hadoop.
It took 10 years for Hadoop's MapReduce to really hit its stride as The Way We Do Big Data. It will take fewer than five more years to completely displace it.
And while Apache Spark is clearly the short-term heir to the MapReduce throne, what's less clear is how long Spark can maintain its leadership. Even as technology becomes cheap, the people powering that technology are brutally expensive, which means the next quantum leap in big data computing may come from technology that makes developers more productive.
When Hadoop met Spark
Hadoop used to be the "it" technology within the big data stack. As Derrick Harris wrote just two years ago, over the past decade "Hadoop has gone from being the hopeful answer to Yahoo's search-engine woes to a general-purpose computing platform that's poised to be the foundation for the next generation of data-based applications."
At the heart of Hadoop is distributed storage and compute, with MapReduce dominating discussion around compute.
Or did. Until Spark came along.
Despite being hatched in a UC Berkeley lab in 2009, Apache Spark is rapidly rising to displace MapReduce. A quick glance at Google Search trends shows a pronounced rise in Spark interest, even as MapReduce interest flattens.
Why? Well, largely, it's a matter of speed.
Spark can work with data in-memory, pushing queries 100x faster than MapReduce, and Spark queries on disk manage to run 10x faster than MapReduce queries. For a world addicted to the volume, variety, and velocity of big data, that kind of speed boost is huge.
Small wonder, then, that even MapReduce's erstwhile allies are now practicing their eulogies.
The nasty, brutish, and short world of big data
Cloudera community lead Justin Kestelyn argues that, 10 years from now, Hadoop MapReduce will be a distant memory. Indeed, MapReduce, which has dominated big data for the past 10 years, may not last another five, he speculates, because Apache Spark has killed it, for three central reasons:
- Spark has rich, expressive, identical APIs for Scala, Java, and Python, reducing code volume over MapReduce generally by 2-5x
- Spark applications are an order of magnitude faster than those based on MapReduce
- Spark offers a unified API for both batch and stream processing (one fewer API to learn)
And so, well before the poster child of Hadoop hit its teenage years, MapReduce is on its way out.
Fortunately for Hadoop, this is part of the plan.
The problem with Hadoop, or rather "opportunity," as Cloudera executive Charles Zedlewski told me in an interview, is that Hadoop is in a perpetual state of flux:
"Hadoop will always be a thing that acquires, stores, processes, analyzes, and serves data. That's been true throughout it's existence and hasn't changed much at all. Essential and non-essential components get improved, upgraded, and swapped out over time, but that doesn't change Hadoop's identity. Implying otherwise confuses technical design choices with users and market. The former changes all the time, the latter doesn't."
Today, Apache Spark dominates our attention for big data computing. But not long ago, it was all MapReduce.
So, what will displace Spark?
Kill or be killed
Let's face it, Spark is doomed.
While I have no clue what will replace Spark at the heart of Hadoop compute, I have enough history with big data to know that something will... and probably soon.
IBM can still ship mainframes, because legacy enterprise IT can't quite quit the past. But big data moves at a torrid pace and, as Memsql CMO Gary Orenstein reminded me, tends to "simplify" the the Hadoop stack "by adding new" components rather than upgrading old ones.
But even this question may be off the mark. As MongoDB vice president Kelly Stirman suggests, that last big leap in data had everything to do with optimizing for expensive compute and storage, and then taking advantage of ever-cheaper resources. So, now storage and compute are cheap, but the people that power them are expensive. (Just ask anyone that has tried to recruit engineers in the past few years.)
So, Stirman posits, the next quantum leap in big data is not made up of technology so much as it is about people ("the most precious resource of all").
In this view, the big data technology that may come to matter most is that which is most accessible to developers. We're already seeing some of this with the massive uptake of technologies like MongoDB and Apache Cassandra (supported by DataStax), but even Spark, itself, reflects this shift toward ease of use.
Reflected in Kestelyn's comments above, but also in an interview I had with Databricks' co-founder Ion Stoica, one big reason for Spark's success is its relative ease of use. As he expressed it, Spark workloads are supported by libraries, letting a user deploy the same execution engine to interact with data that is shared among the different libraries.
Given the short shelf life of MapReduce, it's hard to imagine that Spark will fare much better. And while it's hard to pick a winner among competing big data options, the most likely to succeed are those that are easiest to use.
- Three reasons you need to run Spark in the cloud
- Hadoop complexity is part of the master plan, says Cloudera exec
- Cloudera co-founder identifies the biggest opportunities for big data
- Spark promises to up-end Hadoop, but in a good way
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.