Remember when Spark was just a speedy replacement for MapReduce? When only uber-geeks cared about it or could use it (or any of the other colorfully named big data projects)?
Those days are gone.
Sure, not for everyone. But according to a new Databricks survey covering 842 different enterprises, Spark is outgrowing its Hadoop roots, even as it finds fertile soil beyond propellerhead data engineers.
Goodbye, cruel Hadoop
The big data ecosystem has been on an innovation tear, and nowhere has that been more true than the Hadoop ecosystem. Dubbed a "Cambrian explosion" by Hadoop creator Doug Cutting, no project has been sacrosanct.
Including Hadoop. Well, not Hadoop, exactly. Hadoop, sometimes defined by its original core (MapReduce for data processing and HDFS for data storage), has long been much, much bigger than its original core.
As such, swapping out MapReduce for Storm isn't a big deal. It's just a natural evolution for the Hadoop community.
Or, as Cloudera executive Charles Zedlewski told me in an interview, "Essential and non-essential components get improved, upgraded, and swapped out over time, but that doesn't change Hadoop's identity."
But that's not exactly what's happening, according to the Databricks' survey. No, something more fundamental is happening, with a rising number of companies looking to use Spark outside Hadoop (as measured by YARN below):
As big a shift as this is, there's an even more important trend in Spark land....
Big data made simple(r)
One of Spark's advantages over MapReduce has always been simplicity. As Ion Stoica, a Databricks executive, analogized in a TechRepublic interview, Spark's simplicity is similar to the modern smartphone. While we used to carry different devices (cameras, PDAs, phones), the modern smartphone incorporates all this functionality.
One smartphone, many workloads.
Between its natural simplicity and the ease with which it can be run in the cloud (where 51% of Databricks survey respondents acknowledge running Spark), Spark is opening up data science to a new class of data professional.
In terms of platforms, Windows has seen a 283% explosion in Spark adoption (rising from 6% to 23% of all users in the last year). By contrast, the Linux/UNIX crowd only climbed from 51% to 75%.
But what about job titles?
While data engineers claim the biggest percentage (41%), Spark is opening big data to a broader audience:
- Data scientist - 22.2% (7.5% in 2014, according to a separate TypeSafe survey)
- Architect - 17.2%
- Management - 10.6%
- Academic - 6.2%
All of this is good for Spark, obviously, but also for big data, generally. Spark boasts the largest big data community, a community that continues to make the project easier and faster. While this shouldn't translate into "Hadoop is doomed!" eulogies for all the reasons stated above, it does suggest that the center of gravity in big data is shifting to Spark.
- Can anything dim Apache Spark?
- Three reasons you need to run Spark in the cloud
- Hadoop promises not yet paying off
- Spark promises to up-end Hadoop, but in a good way
Matt is currently head of the developer ecosystem at Adobe. The views expressed are his own, not those of his employer.
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.