Hadoop, once defined by MapReduce, is now much more, thanks to Spark.
Just when Hadoop was getting settled, the open-source world is already ready to kill it.
Not all of it, mind you. But MapReduce, that genius insight from Google that developer Doug Cutting turned into Hadoop, is already being retired. Without even hitting its teenage years, Hadoop is being replaced by Apache Spark, a superior data processing engine that overcomes some of Hadoop's core MapReduce limitations.
Like, for example, Hadoop's complexity.
To get more insight into the past, present, and future of big data and Spark's role in it, I talked with Ion Stoica, co-founder of Databricks, the company behind Spark, a next-generation data processing engine that replaces MapReduce in the Hadoop ecosystem.
Making big data easy
One of the biggest problems with big data is that the technology is either insanely expensive, insanely complicated, or both.
Hadoop, being open source, is free (as in beer). But it's also complicated (as in you'll need to drink a lot of beers to dull the pain). As an O'Reilly survey found, "the field of big data has ushered in the arrival of new, complex tools that relatively few people understand or have even heard of," Hadoop being the chief antagonist in this cast of complexity.
Indeed, Hadoop's complexity is just one reason that while the media talks about Hadoop constantly, enterprises have been relatively slow to deploy it.
Spark, as Stoica told me, is much easier, largely because of how it interacts with other systems.
In Hadoop, if you want to support different workloads, you have to deploy different systems. You might use Impala, for example, for real-time, ad hoc queries. But you'd have to learn Mahout, a separate system, for machine learning. Learning one is hard enough. Learning several...? Not pleasant.
But wait! It gets worse, according to Stoica. For example, If you want an application to span these different systems, you'll find it challenging, to say the least, if not impossible. Also, if you want to run interactive queries on streaming data through Storm, you might not be able to achieve low enough latency for it to work.
In contrast, all workloads with Spark are supported by libraries. You use the same execution engine to interact with data that is shared among the different libraries.
Stoica analogizes this library-based approach to the modern smartphone. People used to have to carry separate phones, PDAs, cameras, etc. to do different tasks. But today's smartphone incorporates all this disparate functionality, allowing you to carry one smartphone to handle multiple workloads.
The end of Hadoop?
Nor do the benefits of Spark stop with its relative ease of use. As Stoica told me, Spark is much faster than MapReduce, too. Not only can Spark work with data in-memory, making queries 100x faster than MapReduce, but Spark queries on disk also run 10x faster. Plus, while MapReduce is relegated to batch-oriented applications, Spark supports streaming, interactive queries, graph processing, and machine learning.
Given these three benefits of simplicity, performance, and flexibility, why are we still talking about Hadoop at all?
Well, Hadoop consists of three layers: storage (HDFS), resource management (YARN), and MapReduce (computation engine). Spark plays at the third layer. Stoica feels strongly that Spark will replace MapReduce as Hadoop's default execution engine but was equally sure that it would continue to complement the rest of the Hadoop ecosystem.
It's not an either/or, in other words. Except for MapReduce.
Some assembly required
So should you learn Spark? Definitely maybe.
Spark may be easier than Hadoop to use, but that's not to say it's easy, per se. Given that you're still going to be writing distributed applications, which are inherently harder than writing an application on a single processor, some expertise is required to be proficient with Spark. This is especially true as developers start tuning the application through caching or other techniques.
But the great thing about Spark is that it will already be familiar to many people. After all, as Stoica noted, "Spark has a richness of interfaces. So, if you know SQL, you can be up and running in no time. If you know Hive, you'll feel right at home with Spark SQL."
For other things, like machine learning, "developers will call their machine learning library through a function call." Again, relatively easy.
Finally, Spark comes with interfaces for Java, Python, and Scala, making it highly accessible to those who know these languages, with support for R and more coming.
Stoica sees two primary areas that Databricks and the Spark community will continue to foster Spark development:
- Improving the core scalability and performance of Spark
- Extending and improving the libraries ("We want Spark to evolve into a big data platform, and a platform is only as powerful as the libraries it has.")
With one of the biggest, most active big data communities, Spark looks certain to kill MapReduce, even as it makes the larger Hadoop ecosystem more popular. This is why traditional Hadoop powerhouses like Cloudera have maintained some commitment to MapReduce as they've dramatically increased their commitment to Spark.
It's one of the great benefits of open-source development: the best code, or community, ultimately wins. Hadoop, once defined by MapReduce, is now able to be much, much more, with Spark an essential element of that "more," and helping it remain relevant for years to come.