The meteoric rise of Spark and the evolution of Hadoop

Featured Content

This article is courtesy of TechRepublic Premium. For more content like this, as well as a full library of ebooks and whitepapers, sign up for Premium today. Read more about it here.

Join Today

Spark is a massive upgrade on Hadoop's MapReduce, but can you bet your company's future on it?

Hadoop is dead. Long live Hadoop.

Though Gartner characterizes Hadoop adoption as "fairly anemic," investment in Hadoop remains robust, both in terms of customer dollars and developer code.

That is, if we define Hadoop in terms of its overarching ecosystem and not in terms of the once essential MapReduce component. Because once we peel away the MapReduce veneer, it becomes clear that Hadoop, the ecosystem that "keeps experimenting," as Cloudera co-founder Mike Olson describes it, is booming. Particularly if we focus on Spark, the most active Apache Software Foundation project ever.

And yet Spark, despite its apparently unstoppable rise, won't be the end of Hadoop's evolution. Indeed, "it is incredibly early for Spark," says MongoDB vice president Kelly Stirman, with much to do to make it easier for enterprises to consume.

Enjoying this article?

Download this article and thousands of whitepapers and ebooks from our Premium library. Enjoy expert IT analyst briefings and access to the top IT professionals, all in an ad-free experience.

Join Premium Today

Given such constant innovation in Hadoop land, when can enterprises safely jump into Hadoop?

Yesterday's big data

In some ways Hadoop is a relic of a bygone age, with "bygone" being another way of saying "five years ago."

Until recently, when we talked about "big data" we mostly referred to big as in volume. In such a world, Hadoop's batch-oriented MapReduce made a lot of sense. At that time, as Cloudera's Justin Kestelyn reminded me in an interview, "the terms 'MapReduce' and 'Hadoop' were interchangeable because Hadoop was just a kernel [composed of HDFS and MapReduce]."

Today Hadoop is much more, comprising an ecosystem of 25 to 30 components, including projects for data processing (MapReduce, Spark), storage (HDFS, Parquet), scheduling (Oozie), ingestion (Kafka, Flume), and resource management (Yarn, Mesos), among others.

I suspect that one of the reasons Gartner uncovered "fairly anemic" interest in deploying Hadoop comes down to how enterprises define it. More than 50% of organizations surveyed have no plans to run Hadoop.

But do they mean MapReduce or do they mean Spark? Or both? Or neither?

image1.png
Source: Gartner

When asked about big data adoption, 69% of enterprises said they're working toward deployment. Asked about Hadoop, however, only 46% admitted to planning or deploying. Given that Hadoop and its various offspring play a central role in big data, I suspect we have a definitional problem with "Hadoop" more than an interest problem.

So is Hadoop (think: MapReduce) a bust?

Not at all. Shaun Connolly, Hortonworks vice president of Corporate Strategy, told me:

"A modern data architecture enables businesses to analyze all available data for rich historical insights, to analyze real-time streams of data for immediate actionable insights, and to blend both for closed-loop predictive analytic applications. This requires being able to deal with the complete lifecycle of data-in-motion and data-at-rest created by Internet of Anything data and traditional data sources."

In other words, as much as MapReduce is useful for those "historical insights," Spark (or Storm or other real-time processing engines) is useful for real-time data streaming. As such, Stirman says, "MapReduce isn't going anywhere, and it is still the best option for some use cases."

Based on current interest in Spark, however, one could be forgiven for believing the whole world of big data had gone real-time and was centered on Spark.

Spark becomes a wildfire

Some of the excitement over Spark stems from the disappointment in MapReduce. As Stirman notes, "For many people, Hadoop never lived up to all the hype, and the anticipation is that Spark brings people closer to what they hoped for."

That "hope" translates into frenetic activity of various kinds, including code contributions:

image2.png
Source: O'Reilly Radar

O'Reilly's Ben Lorica posted this data in mid-2014, but if anything, the activity around Spark has only increased its lead since that time. A quick glance at job postings tells much the same story: Spark has already outpaced Hadoop in terms of the absolute number of jobs requiring that skill.

The question is why: Why has Spark so quickly replaced MapReduce in our affections, to the point that Cloudera, the company co-founded by Hadoop's creator, is now replacing MapReduce with Spark in its big data platform?

For starters, as DataStax CTO Jonathan Ellis told me, "Spark is faster, easier to use, and more flexible than MapReduce."

On this last point, Connolly says, "Spark is on the rise because it's useful and embeddable with a range of technologies."

Earlier this year, I spoke with Databricks co-founder, Ion Stoica. He explained that in Hadoop, if you want to support different workloads, you have to deploy different systems. You might use Impala, for example, for real-time ad hoc queries. But you'd have to learn Mahout, a separate system, for machine learning. Learning one is hard enough. Learning several...? Not pleasant.

But it gets even worse, Stoica says. For example, if you want an application to span these different systems, you'll find it challenging, if not impossible. Also, if you want to run interactive queries on streaming data through Storm, you might not be able to achieve low enough latency for it to work.

In contrast, he says, all workloads with Spark are supported by libraries. You use the same execution engine to interact with data that is shared among the different libraries. This means life is much easier with Spark.

Still, the question looms: For whom? That is, for whom is Spark easier than MapReduce?

Developers love Spark

As Kestelyn styles it, Spark adoption is all about pleasing developers: "For developers, working with Spark is simply much, much more productive than working with MapReduce," he told me, "and that advantage is translating into adoption."

Patrick McFadin, DataStax's chief Cassandra evangelist—and a developer himself—concurs:

"I've run large Hadoop deployments and have written enough MapReduce to say I didn't want to do it again. Spark is a next generation cluster computing framework that has the benefit of hindsight after MapReduce was released in Hadoop. Writing useful analytics with only a map and reduce command is a challenge and time consuming. Not only is the job writing slow, the framework requires a lot of servers to be performant. Writing similar Spark jobs is amazingly easier and reduces the time needed from development to execution."

Kestelyn says it certainly helps that "Spark is also an order of magnitude more performant than MapReduce," but ultimately, "Spark's developer API is the real key to its steadily increasing popularity."

Ellis elaborates on this comparative ease:

"From a technical standpoint, the main win for Spark is that it's an optimistic framework instead of pessimistic. With MapReduce, every result in your pipeline is written to distributed storage, then read back off disk by the next stage. This means that if you have a failure part-way through, you don't need to recompute those intermediate steps and you can resume the calculation where it left off.

"Spark instead records just the instructions needed to rebuild a pipeline from its inputs. If a failure does happen, it needs to start over from the beginning, but since failure mid-pipeline is relatively rare, it comes out way ahead on average."

Such advances put Spark well ahead of MapReduce in the hearts of developers, as well as the enterprises that employ them. But again, this doesn't mean that MapReduce has been relegated to the dustbin of history. And it certainly doesn't mean that Hadoop is doomed because, as mentioned, Spark is an essential component of Hadoop, not a replacement thereof.

Kestelyn is right that "Spark certainly supplants MapReduce, but it does not supplant Hadoop as a whole." However, McFadin is equally correct when he says that "Spark doesn't need Hadoop to be successful but the future of Hadoop depends on Spark."

All Hadoop's chips in the Spark basket

So what does the future of big data look like? All Spark, all the time?

No, not really.

First of all, there are already direct competitors to Spark. One, for example, is Apache Flink. (A good comparison of Spark and Flink can be found here.) As McFadin told me, "Both [Spark and Flink] are in the 'We make Hadoop suck less' camp." While a nice start, he believes any successor to Spark will "get rid of all the Hadoop ecosystem, including storage."

In other words, a Monty Python-esque, "And now for something completely different" moment.

Stirman agrees in part, but argues that instead of seeing a replacement for Spark, we will see "a new option that will emerge that relegates Spark to a more narrow, specialized domain of use, just as Spark has done to MapReduce."

One reason this might happen is that "Spark is trying to do a lot," according to Ellis. "There's the basic compute engine, then there's SQL and streaming and machine learning and graph. I'm not sure all of these can be best of breed."

And as Stirman points out, "There is still plenty of opportunity for improvement of Spark in terms of usability, reliability, scalability and security." There's also ample room to push Spark into a more narrowly defined role. And rather than improve Spark, the big data community tends to prefer starting from scratch and building something new.

A world beyond Spark

Despite our current fetish for all-things-Spark, and companies like Cloudera doubling down on it, it's important to remember that there's a wider universe of big data, as Stirman details:

"Let's not overlook that Hadoop and Spark remain focused on analytical use cases. There are other technologies such as MongoDB, Postgres, Cassandra, and Redis that are focused on running the operational applications that power the business, where data is born. These databases are absolutely complementary to Hadoop and Spark, much in the same way relational databases have been to data warehouses in previous generations of data architectures."

Even within the broader Hadoop ecosystem, Connolly says, it's worth keeping an open mind—and open platform—for building big data applications:

"There will always be a new technology that comes onto the scene. This is why we make such a big deal of YARN in the Hortonworks Data Platform (HDP). YARN provides that Data Operating System for Hadoop that can enable any new data processing technology to plug in and benefit from a centralized architecture for resource management, security, operations, etc. YARN also enables partner solutions like Pivotal HAWQ, HP Vertica, and SAS LASR to plug in as first-class tenants of HDP."

Among all this waxing and waning of different technologies, there is one constant: innovation. But that innovation isn't purely about making technology do bigger and better things. It's also about making big data more approachable.

Stirman explains:

"As for the future, vendors will move up the stack, lower the barriers to entry, simplify, and become more solution focused and service oriented. Buyers of most technologies value convenience and predictability. Look for new offerings that are specialized, aimed at specific workflows and their users, GUI, and metered billing. And look for more offerings that are free in exchange for access to your data."

Ultimately, Kestelyn says, "Nothing is sacred in the Hadoop ecosystem." Not its initial poster child, MapReduce, and not Spark. The Hadoop ecosystem is completely focused on pragmatism, and will continue to innovate and replace innovations as necessary to solve enterprise big data needs.

How soon is now?

So when should your enterprise get started with big data and, specifically, Spark? The answer is an unequivocal "NOW!!!"

There will be no point at which the Hadoop ecosystem will stop, pause on innovation, and wait for everyone to catch up. And for the foreseeable future, it's simply not going to be "consumerized."

"Hadoop and Spark are enabling technologies—they are raw, powerful, but very low level," Stirman says. This means that "For organizations that are engineering-led with the necessary budget and talent to build and manage large, complex infrastructure, it can be advantageous to build applications with these technologies."

It also means that organizations that are less centered on diving deep into Hadoop may struggle. But given what's at stake—the future of your data-driven business—waiting is not an acceptable response.

Join Premium Today